Research ArticlePreeclampsia

Circulating transcripts in maternal blood reflect a molecular signature of early-onset preeclampsia

See allHide authors and affiliations

Science Translational Medicine  01 Jul 2020:
Vol. 12, Issue 550, eaaz0131
DOI: 10.1126/scitranslmed.aaz0131

Pinpointing preeclampsia

Preeclampsia is a poorly understood pregnancy-associated disorder. Often marked by hypertension, preeclampsia can lead to severe complications including internal bleeding, seizures, stroke, premature birth, and death. The condition can develop suddenly, and identifying who is at risk of preeclampsia is difficult. Munchel et al. examined circulating RNA in pregnant women who developed early-onset preeclampsia and found 49 transcripts of maternal, placental, and fetal origin that classified a small but independent cohort of pregnant women with early-onset preeclampsia. These transcripts may reveal preeclampsia biology and, if validated in a larger cohort, may enable much-needed preeclampsia diagnostics.


Circulating RNA (C-RNA) is continually released into the bloodstream from tissues throughout the body, offering an opportunity to noninvasively monitor all aspects of pregnancy health from conception to birth. We asked whether C-RNA analysis could robustly detect aberrations in patients diagnosed with preeclampsia (PE), a prevalent and potentially fatal pregnancy complication. As an initial examination, we sequenced the circulating transcriptome from 40 pregnancies at the time of severe, early-onset PE diagnosis and 73 gestational age–matched controls. Differential expression analysis identified 30 transcripts with gene ontology annotations and tissue expression patterns consistent with the placental dysfunction, impaired fetal development, and maternal immune and cardiovascular system dysregulation characteristic of PE. Furthermore, machine learning identified combinations of 49 C-RNA transcripts that classified an independent cohort of patients (early-onset PE, n = 12; control, n = 12) with 85 to 89% accuracy. C-RNA may thus hold promise for improving the diagnosis and identification of at-risk pregnancies.


Preeclampsia (PE) is one of the most common and serious complications of pregnancy, affecting an estimated 4 to 5% of pregnancies worldwide (1, 2), and is associated with substantial maternal and perinatal morbidity and mortality (3, 4). In the United States, the incidence of PE is increasing because of advancing maternal age and the increasing prevalence of comorbid conditions such as obesity (5), costing the U.S. healthcare system an estimated 2 billion dollars annually (6).

PE is diagnosed as new-onset hypertension accompanied by maternal end-organ damage occurring after 20 weeks’ gestation (7, 8). However, there is substantial heterogeneity in the presentation and progression of PE, including timing of disease onset, symptom severity, clinical manifestations, and maternal and neonatal outcomes (9). PE is primarily delineated on the basis of whether it manifests before (early onset) or after (late onset) 34 weeks and whether it presents with severe features, such as sustained elevation in blood pressure ≥160/110 mmHg, neurological symptoms, or severe liver or kidney injury (10).

The pathophysiology of early-onset PE is incompletely understood but is thought to occur in two phases (11). Early-onset PE originates with abnormal implantation and placentation in the first trimester, related to maternal immune dysfunction (1214), incomplete cytotrophoblast differentiation (15), or oxidative stress at the maternal-placental interface (16), resulting in incomplete remodeling of the maternal spiral arteries and failure to establish the definitive uteroplacental circulation (17). This leads to inadequate placental perfusion after 20 weeks’ gestation. The resultant placental dysregulation triggers phase 2, which manifests predominantly as maternal systemic vascular dysfunction with negative consequences for the fetus, including fetal growth restriction and iatrogenic preterm birth (1820). In contrast, the placental dysfunction in the late-onset PE is thought to be due not to abnormal placentation, but to disturbances in placental perfusion resulting from maternal vascular disease, such as that seen in patients with chronic hypertension, pregestational diabetes (21), and collagen vascular disorders (22).

The heterogeneity and complexity of this disease have made it difficult to diagnose, to predict risk, and to develop treatments. Furthermore, the inability to easily interrogate the primary affected organ, the placenta, has limited molecular characterization of disease progression. Circulating RNA (C-RNA) has shown great promise for noninvasive monitoring of maternal, placental, and fetal dynamics during pregnancy (23, 24). C-RNA is released by many tissues into the bloodstream via multiple cellular processes of apoptosis, microvesicle shedding, and exosomal signaling (25). Because of these diverse origins, C-RNA measurements reflect tissue-specific changes in gene expression, intercellular signaling, and the degree of cell death occurring within different tissues throughout the body. Thus, C-RNA has the potential to elucidate the molecular underpinnings of PE and ultimately identify predictive, prognostic, and diagnostic biomarkers of the disease (26).

Several studies have begun to investigate and identify potential C-RNA–based markers for a range of pregnancy complications, including preterm birth, PE, and infectious disease (2729). However, the interindividual variability in this sample type threatens to obscure subtle changes in disease-specific markers (30). Therefore, many PE-focused C-RNA studies have measured RNAs of previously identified serum protein biomarkers, including soluble fms-like tyrosine kinase-1 (FLT1), soluble endoglin, and oxidative stress and angiogenic markers (3133). Although protein measurements of these serum markers are known to be altered in PE (3436), it is unknown whether these genes will serve as the most effective predictors in C-RNA, thus warranting a broader discovery approach. In this study, we hypothesized that global measurements of the circulating transcriptome could detect unique molecular signatures specific to early-onset severe PE.


Establishing a reproducible whole-transcriptome workflow for C-RNA

C-RNA is present in plasma in relatively low abundance, dominated by ribosomal RNA (rRNA) and globin RNA, and is a mixture of fragmented and full-length transcripts (37), all of which can affect the efficiency of library preparation methods for next-generation sequencing. Thus, we optimized our workflow to minimize variability in transcript abundance measurements and maximize exonic C-RNA signals (fig. S1A).

Highly abundant globin and rRNA do not inform biomarker discovery and must be removed. However, standard depletion methods such as Ribo-Zero (Illumina Inc) or NEBNext rRNA Depletion (New England Biolabs) are not well suited for low starting amounts of RNA (38); although upfront depletion with these methods successfully removed 90% of unwanted ribosomal sequences from C-RNA (P < 0.001 versus no pretreatment), sequencing libraries did not consistently exhibit a resulting increase in exonic C-RNA signal (fig. S1B). Instead, samples showed a fivefold increase in the proportion of reads mapping to a complex and variable population of nonhuman RNA sequences such as GB Virus C (P < 0.001 versus no pretreatment), consistent with previous reports (39, 40). In addition, removal of highly abundant rRNA and globin RNA results in extremely low RNA inputs, which increases the failure rate of ligation-based library preparation methods. To avoid these problems, we generated a library from all C-RNA, followed by probe-assisted enrichment targeting the whole human exome (fig. S1A). This method consistently generated high-quality sequencing libraries that were composed of 87 ± 3% exonic C-RNA (P < 0.001 versus no pretreatment and rRNA depletion) with minimal contaminating signal from transcripts of limited interest (P < 0.001 versus no pretreatment for rRNA; P < 0.001 versus rRNA depletion for unaligned reads; fig. S1B).

There was considerable interindividual variability in C-RNA plasma concentrations, which varied by an order of magnitude (1.1 ± 0.7 ng/ml; range, <0.1 to 5 ng/ml) (fig. S2A). To ensure reproducible results, we evaluated the effects of plasma volume input on C-RNA data quality. Using less than 2-ml plasma significantly increased the biological coefficient of variation (P ≤ 0.02) and decreased library complexity (P ≤ 0.001; fig. S2, B and C), leading to a decrease in sensitivity. We therefore selected a 4-ml plasma input to minimize noise, maximize confidence in data quality, and ensure successful data generation from all individuals regardless of C-RNA plasma concentrations.

To achieve consistent handling of all samples across diverse collection sites, all processing was centralized and required shipping of the collected blood samples to a single laboratory. This necessitated an assessment of the impact of overnight shipping on C-RNA data quality. Blood from nonpregnant and pregnant women [gestational age (GA), >28 weeks] was collected in four types of blood collection tubes (BCTs) and stored overnight at the manufacturer-indicated temperature before processing. Control samples were processed at the clinic within 2 hours to provide a baseline. The C-RNA pregnancy signal in each sample was measured by summing the normalized abundance of 155 transcripts identified as pregnancy markers in two prior C-RNA publications (23, 24). After overnight storage, the C-RNA pregnancy signal was clearly detectable in pregnant samples despite a reduction in overall signal intensity as compared to immediate processing of the samples (fig. S3, A and B). No significant differences in pregnancy transcript abundance were observed between the different samples after overnight storage, indicating that all are appropriate for C-RNA analysis. After selecting Cell-Free DNA collection tubes for all subsequent sample collections, we performed a time course experiment and confirmed that there was no increase in technical variance after room temperature storage for up to 5 days (fig. S3C).

The complete workflow was validated by recapitulating previous work monitoring C-RNA dynamics in healthy pregnancies from first to third trimester (23, 24). Using 152 samples collected serially from 41 healthy pregnancies (Pre-Eclampsia and Growth Restriction Longitudinal Study Healthy Control Cohort—PEARL HCC, NCT02379832; table S1), we identified 156 differentially regulated transcripts, with the majority increasing in abundance as pregnancy progressed (Fig. 1A and table S2). Fifty-one percent of identified transcripts changed primarily during the first trimester, 6% in the third trimester, and 43% were differentially regulated throughout gestation (fig. S4 and table S2). First trimester genes were enriched for placental steroidogenesis and regulation of trophoblast differentiation, whereas third trimester genes were involved in the onset of labor. Transcripts that increase throughout gestation are associated with tissue and organ development and morphogenesis (4144). The results from the PEARL HCC were concordant with the literature, as 42% of the altered genes were identified in prior C-RNA studies (Fig. 1B) (23, 24). Of the 91 transcripts identified only in this study, 64% were expressed by placental tissues, fetal tissues, or both (Fig. 1C and fig. S5), where tissue specificity was determined from Body Atlas references (Correlation Engine, Illumina Inc.) (45). We hypothesize that the remaining genes reflect maternal tissue responses to pregnancy (table S2).

Fig. 1 Changes in the C-RNA transcriptome track with pregnancy progression.

(A) Temporal changes in transcripts altered throughout the course of pregnancy. Each row corresponds to a transcript, the abundance of which was normalized across all samples (n = 152) before clustering. Orange signifies elevated abundance; purple indicates decreased abundance. (B) Overlap of transcripts identified in three independent C-RNA pregnancy progression analyses (Koh, n = 33; Tsui, n = 4; PEARL HCC, n = 152). (C) Tissues expressing the 91 genes that were only detected in C-RNA from the PEARL HCC cohort.

Clinical study design for early-onset severe PE

After confirming that this workflow robustly detected pregnancy-related C-RNA dynamics, we sought to identify changes in C-RNA associated with pregnancy complications. We therefore applied our workflow to samples collected from two independent PE cohorts (PEC), the Illumina PEC (iPEC; NCT02808494) and the PEARL PEC (NCT02379832). The iPEC was used for biomarker identification, whereas the PEARL PEC was used for independent confirmation of our findings. All samples in both cohorts were processed, and libraries were generated in the same manner as discussed in the previous section.

The iPEC study focused on early-onset PE with severe features and excluded women diagnosed with additional health complications such as chronic hypertension or diabetes to prevent additional heterogeneity from obscuring a consistent PE-associated C-RNA signal (table S3). We collected 113 samples across 8 sites (table S5), 40 at the time of early-onset PE diagnosis, and 73 controls that were GA-matched to within 1 week (Fig. 2A). Maternal characteristics, pregnancy outcomes, and medications were recorded throughout the study (Table 1 and table S4). Fetal sex, maternal age, and gravida were not significantly different between the PE and control groups (P > 0.58; Fisher’s exact test for fetal sex, t test for maternal age, and Mann-Whitney U test for gravida). In contrast, body mass index (BMI) was significantly higher in the PE cohort (P = 0.0004 by Mann-Whitney U test; control, 30.1 ± 5.6; PE, 34.2 ± 5.8) (46). All but one patient with PE gave birth prematurely, in contrast to 9.5% of controls, confirming that our diagnostic criteria identified individuals severely affected by this disease (Fig. 2C).

Fig. 2 Sample collection, but not clinical outcome, is matched for control and PE samples.

Panels illustrate the time of blood collection (triangles) and GA at birth (squares) for each individual in (A) the iPEC study and (B) the PEARL PEC study. The orange line indicates the threshold for term birth at 37 weeks. (C) Preterm birth rates in PE cohorts. ***P < 0.001 by Fisher’s exact test. iPEC: control, n = 73; PE, n = 40; PEARL PEC: n = 12 for each group.

Table 1 Characteristics for the iPEC.

View this table:

The PEARL PEC samples were collected by an independent institution [Centre Hospitalier Universitaire (CHU) de Québec–Université Laval] and consisted of 12 early- and 12 late-onset PE pregnancies with equal numbers of GA-matched controls (Fig. 2B). Maternal characteristics, pregnancy outcomes, and medications in use were recorded throughout the study (table S6). Similarly to iPEC, 100% of early-onset patients in PEARL PEC delivered prematurely, whereas 75% delivered at term in the late-onset cohort, confirming the differences in severity associated with early- and late-onset PE. Chronic hypertension, diabetes, and other maternal health conditions were not grounds for exclusion, making this cohort more representative of the heterogeneity inherent to the pregnant population.

Identification of transcripts consistently altered in early-onset PE

Standard differential expression analysis (47) using the full iPEC cohort identified 42 transcripts with altered abundance in plasma, 37 of which were increased in PE (Fig. 3A). However, we observed variability in the differentially abundant transcripts when different subsets of samples were selected for analysis. We therefore incorporated a jackknifing approach (48) to identify transcript abundances that were most consistently altered when comparing PE to control samples (Fig. 3, A and B). We performed 1000 iterations of differential analysis with randomly selected PE and control sample subsets, resulting in the construction of confidence intervals for the P values associated with each putatively altered transcript (Fig. 3C). Twelve transcripts whose P value confidence interval exceeded 0.05 were subsequently excluded (Fig. 3B). These transcripts would not have been excluded by simply setting a threshold for baseline abundance or biological variance (Fig. 3D); however, these transcripts were observed to have lower predictive value (Fig. 3E). Hierarchical clustering indicated that these transcripts were not universally altered in the PE cohort and thus lacked sensitivity (73%) for accurate classification of this condition (Fig. 3F).

Fig. 3 Applying jackknifing to differential expression analysis excludes genes with lower sensitivity for identifying PE samples.

(A) Fold change and abundance of transcripts altered in PE. CI, confidence interval. (B) One-sided, normal-based 95% confidence intervals of the P value for each transcript detected as altered by differential expression analyses. (C) Schematic of the jackknifing approach, wherein 90% of samples were randomly selected for analysis over many iterations to quantify P value stability. Pink squares and teal triangles represent samples from PE and control groups, respectively. (D) Average abundance and noise for each differentially abundant gene (P > 0.22 for both variables by Mann-Whitney U test). (E) ROC area under the curve (AUC) values for each differentially abundant transcript reflect how separated C-RNA transcript abundance distributions are for control versus PE samples (*P < 0.05 by Mann-Whitney U test). For (D) and (E), Included, n = 30; Excluded, n = 12. (F) Hierarchical clustering of iPEC samples using the genes excluded after jackknifing (sensitivity = 73%; specificity = 99%; n = 113). (A, B, D, and E), orange and blue data points reflect transcripts statistically altered in PE by standard differential expression analysis, but only orange data points were identified by the jackknifing approach. Sample status is shown by blue (PE) and gray (control) rectangles along the right side of the heatmap.

We independently quantified a representative set of 20 transcripts altered in PE by quantitative polymerase chain reaction (qPCR) in a subset of affected and control iPEC patient samples. Fold changes measured by qPCR were highly concordant with the sequencing data, validating our findings (Fig. 4A and table S7). Fifty-eight percent of the transcripts in the refined list have previously been associated with PE (table S8). In addition, nearly all genes could be linked to PE-relevant processes, including extracellular matrix (ECM) remodeling, pregnancy duration, placental/fetal development, angiogenesis, and hypoxia response (table S8). Sixty-seven percent of these genes were expressed by the placenta, fetus, or both (Fig. 4B). Cardiovascular and immune functions, both of which are altered in PE, were well represented in the remaining maternally expressed genes (table S8) (11).

Fig. 4 Ubiquitously altered C-RNA transcripts segregate early-onset PE samples from controls.

(A) The fold change between PE and control pregnancies from iPEC was assessed both by sequencing (orange) and by qPCR (purple) for 20 transcripts (**P < 0.01 and ***P < 0.0001 by Mann-Whitney U test; n = 19 for control and PE). (B) Tissue expression of all differentially abundant genes identified in iPEC. (C) Hierarchical clustering of the iPEC samples (average linkage, squared Euclidean distance; PE, n = 40; control, n = 73). Clustering of early-onset (D) and late-onset (E) PE and control pregnancy samples from the PEARL PEC (n = 12 for each group).

Hierarchical clustering of these transcripts effectively segregated PE and control samples from iPEC with 98% sensitivity and 97% specificity (Fig. 4C). We next sought to validate the refined list of 30 transcripts in the independent PEARL PEC cohort. Early-onset PE samples clustered separately from matched controls with 83% sensitivity and 92% specificity, further validating the relevance of these transcripts (Fig. 4D). In contrast, no clustering was observed for the late-onset PE and matched control samples, indicating that late-onset PE had a potentially weaker or a different C-RNA signature (Fig. 4E) (49, 50).

Upon closer inspection of the clinical data for the three misclustered iPEC samples, we found that two controls suffered from potentially confounding health problems, including hypertension and, in one case, accompanied by preterm delivery. These controls should not have been enrolled in the iPEC due to our stringent exclusion criteria; thus, they were excluded from further analyses. The misclustered PE sample showed no clinical abnormalities and was retained in our iPEC dataset.

Development of a robust machine learning classifier for early-onset PE

Differential expression analysis confirmed that C-RNA detected biologically relevant changes in patients with PE. To assess whether C-RNA signatures could robustly classify PE, we used the data from the iPEC cohort to build an AdaBoost model (5153). A randomly selected 10% of samples were excluded as a holdout set from the training process to assess final model performance. We used a nested cross-validation approach for hyperparameter optimization (fig. S6) and AdaBoost model building (fig. S7) (54).

While developing our machine learning approach, we observed a high degree of variability in AdaBoost performance and in the genes selected depending on which samples were included in the training set (fig. S8). These observations indicated that different subsets of samples substantially affected model construction (55), which we suspect was, in part, due to the heterogeneity of PE. To account for this diversity, we fit AdaBoost models to multiple combinations of the training samples. The estimators from these orthogonally generated models were then combined into a single aggregate and pruned to obtain a minimal gene set (fig. S7) (56, 57). This allowed us to capture the wide diversity of PE manifestations in a refined machine learning model with the potential to accurately classify independent samples from a broad pregnancy population.

Within each fold of the cross-validation, training samples were used to identify a threshold AdaBoost score for discriminating PE and control samples, which maximized both sensitivity and specificity. Across all 10 folds, we obtained an average area under the receiver operating characteristic (ROC) curve (AUROC) of 0.964 ± 0.068 (Fig. 5A). We first assessed performance in the holdout iPEC samples, obtaining 89 ± 5% accuracy with 88 ± 13% sensitivity and 92 ± 6% specificity (Fig. 5B). AdaBoost classification performance was not affected by the amount of time before plasma processing, further supporting the robustness of our sample preparation protocol and analyses (fig. S4, D and E). We subsequently investigated the model’s ability to classify the independent PEARL PEC cohort. Applying the classifier to early-onset PEARL PEC samples achieved 85 ± 4% accuracy, with 77 ± 9% sensitivity and 92 ± 7% specificity (Fig. 5B). Late-onset PEARL PEC samples were also classified with a reasonable accuracy of 72 ± 6%, sensitivity of 59 ± 10%, and specificity of 80 ± 10%.

Fig. 5 Machine learning accurately classifies PE in an independent cohort.

(A) Average ROC curve for iPEC validation samples (dashed line = SD; N = 10). (B) Accuracy, sensitivity, and specificity measurements of iPEC hold out samples and independent PEARL PEC samples (N = 10). (C) Heatmap of the relative transcript abundance in the iPEC cohort for the genes used by AdaBoost model. The graph on the right indicates how many cross-validation models a given transcript appeared in. (D) Concordance of transcripts identified by differential expression analysis (DEX) and AdaBoost. (E) Tissues expressing transcripts selected by AdaBoost models.

Forty-nine total transcripts were used by AdaBoost, with 63% selected in at least two rounds of model building (Fig. 5C). We observed concordance with prior analyses, as 40% of the genes identified in the jackknifing analysis were also used in our machine learning classifier (Fig. 5D and table S9). In addition, 38% of the transcripts used by the classifiers had elevated expression in the placenta or fetus (Fig. 5E). AdaBoost models incorporated transcripts reflecting a diversity of PE-relevant pathways, particularly genes associated with immune regulation and fetal development (table S9).


Whole-transcriptome C-RNA analysis casts the wide net necessary for effective biomarker discovery, capturing a molecular snapshot of the diverse and complex interactions of pregnancy at a single point in time. We tailored our workflow and analyses to minimize technical noise, obtain high-quality C-RNA measurements, and ultimately detect biologically relevant alterations. We detected molecular changes specific to the complex pathophysiology of early-onset severe PE at the time of diagnosis, supporting robust classification across cohorts. The altered C-RNA transcripts identified represented contributions from maternal, placental, and fetal tissues, many of which would not be captured in studies focusing on placental tissues collected after delivery. These discoveries highlight the power of C-RNA to comprehensively monitor signals contributed by diverse tissues of origin while the pregnancy is ongoing.

To identify the best method capable of detecting global and potentially subtle changes in pregnancy, we examined the effects of plasma input, library preparation methods, and BCTs on C-RNA data quality. As the majority of transcripts appeared to be present in plasma at low abundance, we chose to use 4-ml plasma inputs in our protocol to minimize noise due to sampling error and poor library conversion, both of which plague low-input sequencing applications. Upfront depletion of abundant RNA did not eliminate all contaminating RNA species, which were numerous and diverse in our sample population. This made targeted depletion infeasible; thus, we selected a whole-transcriptome enrichment approach to consistently isolate the exonic C-RNA signal of interest. Overnight shipping was a logistical requirement of our protocol but runs the risk of introducing both signal loss due to C-RNA degradation and contamination with additional RNA from cell lysis. We selected Streck Cell-Free DNA BCTs, which inhibit cell lysis at room temperature (58). This collection method is not specifically designed for RNA stabilization, but we saw little evidence of C-RNA degradation after storage for several days. Previous studies have shown that C-RNA has sufficient endogenous protection from extracellular nucleases (59), and thus, additional precautions to protect the RNA are unnecessary. Together, these optimizations identified a workflow that maximized C-RNA transcriptomic signal and minimizes technical variability, as illustrated by the numerous biologically relevant alterations observed across healthy and PE pregnancies.

We focused our analyses on identifying differences in circulating transcriptomes that are ubiquitous to the most extreme phenotype of the disorder, namely early-onset PE with severe features. This required tailoring our approach to account for the variability observed in our data that stems from both the substantial biological noise in C-RNA measurements and the phenotypic diversity of PE. C-RNA is inherently more variable than single tissue transcriptomics, because it interrogates RNA from diverse tissues and biological processes, detecting not only changes in gene expression but also differences in the rates of cell death and intercellular signaling. Furthermore, PE exhibits a wide range of maternal and fetal manifestations and outcomes, which may be associated with different underlying molecular causes and responses. Although the genes we eliminated after jackknifing may be biologically relevant in PE, they were not universally altered in our affected cohort. These transcripts may indicate a molecular subset of the disease, and larger cohorts will help elucidate whether C-RNA can further delineate PE subtypes, which is crucial to understanding the diverse pathophysiology of this syndrome.

The transcripts identified by jackknifing represent a diversity of functions spanning the maternal-fetal interface. A majority of the identified changes relate to placental dysfunction and altered fetal development. One of the most notable trends was an increased abundance of ECM remodeling and cell migration proteins (n = 10), tracking with dysfunctional extravillous trophoblast invasion characteristic of early-onset PE (6062). Twenty percent of dysregulated genes identified encode for angiogenic proteins, consistent with a number of observations that the balance of angiogenic factors plays a crucial role in regulating placental vascular development (63) and can identify early-onset PE with severe features (64). Our data indicated that fetal growth and development were also perturbed in early-onset severe PE, as evidenced by increased abundance of four transcripts encoding regulators of insulin-like growth factor (IGF) signaling (65, 66), a critical pathway for fetal development (67). The remaining transcripts captured the maternal component of PE, namely immune and cardiovascular system dysregulation. Evidence of maternal immune imbalance, a hallmark of PE, appeared as altered abundance of immunological tolerance and pro- and anti-inflammatory factors (6871). Transcripts important to blood pressure regulation as well as several genes linked to atherosclerosis were also identified as altered in PE C-RNA profiles, consistent with maternal vascular disease as an underlying mechanism predisposing some patients to PE (72, 73). The transcripts identified captured a diversity of PE-relevant functions and highlight the ability of C-RNA to simultaneously monitor the numerous molecular processes implicated in complex disease.

We next sought to determine whether C-RNA could not only detect biologically relevant changes but also accurately classify pregnancies affected by early-onset severe PE. Our approach to AdaBoost model building identified combinations of transcripts that can classify across distinct patient subsets while excluding features that could lead to overfitting and bias in our model. Although 76% of transcripts used by AdaBoost were not identified as differentially abundant, they still reflected the same PE-relevant pathways that were captured in our jackknifing analysis. The success of this strategy was illustrated by the accurate classification of the independent early-onset PEARL PEC. PEARL PEC samples were collected at the time of diagnosis from a different population than the one used for training, with less stringent inclusion and exclusion criteria. For example, this cohort included five women who had chronic hypertension or gestational diabetes, only one of which was misclassified by a few AdaBoost models, indicating the C-RNA changes used in machine learning were specific to PE.

Our work offers an important step toward improved understanding and diagnosis of PE; however, it is limited by several factors. This is a proof-of-concept study, and the limited size of our clinical cohorts does not capture the phenotypic diversity of the global pregnant population. We believe that this is reflected in the lower values obtained for sensitivity than specificity in our classification analyses. Larger cohorts will better encompass the heterogeneity of PE, identifying signals for diverse manifestations of disease. A second limitation is the targeted nature of whole-transcriptome enrichment, which does not capture the full range of noncoding or nonhuman transcripts present in plasma. Certain infections and microRNAs (miRNAs) have been associated with PE (74, 75). Thus, future studies should aim to incorporate these measurements with C-RNA transcriptomic data. Although our findings are highly consistent with what is reported in the literature, we did not observe alterations in transcripts of widely reported serum protein biomarkers such as soluble FLT1 (sFLT), vascular endothelial growth factor (VEGF), or placental growth factor (PlGF) (11). This result may be due to the fact that gene expression is not always correlated with protein abundance or release into the circulation.

We focused the iPEC sample collection specifically on early-onset PE with severe features at the time of diagnosis. This is the most extreme phenotype of the disease; however, it represents a small percent of PE cases, and further in-depth exploration across the clinical PE spectrum is warranted. As a preliminary examination, we applied our AdaBoost model to the late-onset PEARL PEC cohort and achieved reasonably good accuracy (72%). This is unexpected, given the substantial evidence that early- and late-onset PE are distinct conditions, despite sharing a final common phenotype of placental dysfunction (76). Thus, some of the transcripts identified by AdaBoost likely reflect the response to uteroplacental insufficiency rather than its source and therefore may not have predictive value early in disease progression. In contrast, we also observed alterations in transcripts involved in angiogenesis and trophoblast invasion (7779), known molecular drivers of PE initiation, and we speculate that these may have predictive value early in disease progression. Regardless of whether the changes detected represent cause or effect, or even a combination thereof, our methods and findings show that C-RNA provides a unique opportunity to build robust diagnostic algorithms and investigate mechanisms of disease that were never considered before.

Our successful classification of patients with PE at the time of diagnosis showcases how C-RNA profiles can be used to robustly monitor maternal, fetal, and placental functions in real time. Future studies should focus earlier in pregnancy to evaluate the potential of this approach to improve prognostication and prediction of outcomes for women with PE. Such studies hold great promise for uncovering predictive biomarkers for early stratification of all at-risk pregnancies, informing prophylactic interventions, or more vigilant monitoring of pregnancy. We believe that the application of C-RNA will ultimately provide comprehensive molecular monitoring of maternal and fetal health throughout the course of pregnancy.


Study design

The objective of this study was to determine whether C-RNA can detect molecular markers associated with early-onset PE with severe features at the time of diagnosis. This goal was achieved by optimizing a protocol to obtain robust whole-transcriptome C-RNA measurements, analyzing blood plasma C-RNA profiles from 40 patients at the time of PE diagnosis and 76 GA-matched control pregnancies, and validating our findings with C-RNA data generated from an independent cohort. We followed the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) reporting guidelines for diagnostic studies. Clinical protocol and informed consent forms for the iPEC study were approved by each clinical site’s Institutional Review Board; inclusion and exclusion criteria are specified in table S3 and described in detail below. Sample sizes were determined by the number of patients with PE we expected to enroll over a 2-year period, based on the population prevalence of early-onset PE with severe features. We obtained up to two GA-matched controls per PE case. Patient samples were excluded only on the basis of clinical information that did not align with our exclusion criteria. Patient samples were randomly assigned to either training or validation groups for predictive modeling. The number of replicates is indicated in the figure legends. Investigators were blinded to sample status through the bioinformatic processing of sequencing data. Additional method details are available in the Supplementary Materials. Raw data are reported in data files S1 to S4.

Clinical sample collection

Illumina Preeclampsia Cohort. Pregnant patients were recruited in an Illumina-sponsored clinical study protocol (NCT02808494) in compliance with the International Conference on Harmonization for Good Clinical Practice. Participants were recruited across eight different clinical sites: University of Texas Medical Branch (Galveston, Texas), Tufts Medical Center (Boston, MA), Columbia University Irving Medical Center (New York, NY), Winthrop University Hospital (Mineola, NY), St. Peter’s University Hospital (New Brunswick, NJ), Christiana Care (Newark, DE), Rutgers University Robert Wood Johnson Medical School (New Brunswick, NJ), and New York Presbyterian/Queens (New York, NY).

After obtaining informed consent, 20-ml whole blood samples were collected from 40 singleton pregnancies with a diagnosis of PE before 34 weeks’ gestation with severe features defined per American College of Obstetricians and Gynecologists (ACOG) guidelines (table S3) (7). Samples from 76 healthy pregnancies were also collected and were matched for GA to the PE group. Three control samples developed term PE after blood collection and were excluded from the study. Maternal characteristics, birth outcomes (Table 1), and medications (table S4) in use were all recorded during the study.

Pre-Eclampsia and Growth Restriction Longitudinal Study. Plasma samples from the PEARL study (NCT02379832) were used as an independent validation cohort. PEARL samples were collected at the CHU de Québec by principal investigator E. Bujold. Only participants above 18 years of age were eligible, and all pregnancies were singleton. A group of 45 control pregnancies (PEARL HCC) and 45 case pregnancies (PEARL PEC) were recruited in this study, and written informed consent was obtained for all patients. A selection of plasma samples was obtained after the study had reached completion.

The criteria for PE was defined based on the Society of Obstetricians and Gynecologists of Canada (SOGC) June 2014 criteria for PE, with a GA requirement between 20 and 41 weeks, encompassing both early-onset (diagnosed <34 weeks; n = 12) and late-onset (diagnosed >34 weeks, n = 12) PE. A blood sample was taken once at the time of diagnosis from the PEARL PEC samples. The PEARL HCC included 45 pregnant women who were expected to have a normal pregnancy and were recruited between 11 and 13 weeks’ gestation. Each enrolled patient was followed longitudinally with blood drawn at four time points throughout pregnancy until birth. The control women were divided into three subgroups, and subsequent follow-up blood draws were staggered to cover the entire range of GAs throughout pregnancy (table S1). In addition to using the PEARL HCC samples to assess C-RNA in healthy pregnancy, samples from 24 unique individuals in the PEARL HCC were selected to serve as GA-matched controls for both the early- and late-onset PE cohorts.

Sample preparation

Plasma processing. All samples from the iPEC and the PEARL cohorts were processed in randomized batches by investigators blinded to disease status. Two tubes of blood were collected per patient in Cell-Free DNA BCTs (Streck Inc.). Blood samples collected in iPEC were stored and shipped at room temperature overnight and processed within 120 hours. PEARL blood samples were collected, processed into plasma within 24 hours, and stored at −80°C until shipped to Illumina on dry ice. All blood was centrifuged at 1600g for 20 min at room temperature, and plasma was transferred to a new tube and centrifuged at 16,000g for additional 10 min. The plasma supernatant was stored at −80°C until use.

Sequencing library preparation. C-RNA was extracted from 4.5 ml of plasma with the Circulating Nucleic Acid Kit (Qiagen) followed by DNAse I digestion (Thermo Fisher Scientific) according to the manufacturer’s instructions. C-RNA was fragmented at 94°C for 8 min, followed by random hexamer primed cDNA synthesis using the Illumina TruSight Tumor 170 Library Preparation kit (TST170; Illumina Inc.). Illumina sequencing library preparation was carried out according to the TST170 kit for RNA, with two modifications: All reactions were reduced to 25% of the original volume, and the ligation adaptor was used at a 1 in 10 dilution. Library quality was assessed with the High Sensitivity DNA Analysis chips on the Agilent Bioanalyzer 2100 (Agilent Technologies).

Whole-transcriptome enrichment. Sequencing libraries were quantified with Quant-iT PicoGreen dsDNA Kit (Thermo Fisher Scientific), normalized to 200-ng input and four samples pooled per enrichment reaction. The Illumina TruSeq RNA Enrichment kit (Illumina Inc.) was used to carry out whole-exome enrichment. Briefly, biotinylated oligos targeting the exome were hybridized to sequencing libraries and pulled down by magnetic streptavidin beads to enrich libraries for exonic RNA. This process was performed two times to maximize exonic enrichment. The final enriched library was then reamplified by PCR to provide sufficient yield for sequencing. Blocking oligos lacking the 5′ biotin designed against hemoglobin genes HBA1, HBA2, and HBB were included in the enrichment reaction to minimize contributions from highly abundant hemoglobin. Final enrichment libraries were quantified using Quant-IT Picogreen dsDNA Kit (Thermo Fisher Scientific), normalized, and pooled for 50 base-pair paired-end sequencing on Illumina HiSeq 2000 platforms to a minimum depth of 40 million reads per sample.

Sequencing data analysis

Bioinformatic processing of sequencing data. Fastq files containing more than 50 million reads were downsampled to 50 million reads with seqtk (v1.2-r102-dirty). Sequencing reads were mapped to human reference genome (hg19) with TopHat2 (v2.0.13) (71), and transcript abundance was quantified with featureCounts (subread-1.4.6) (72) against RefGene coordinates (obtained 27 October 2014). Aggregate tissue expression datasets obtained from Body Atlas (Correlation Engine, Illumina Inc.) were used to categorize transcript tissue contribution (45); genes with expression ≥2-fold higher in the placenta or a fetal tissue (brain, liver, lung, and thyroid) versus the median expression across all tissues were assigned to that group. Subcellular localization was obtained from UniProt (73). Functional enrichment analyses were performed with g:Profiler (v e97_eg44_p13_d22abce) (80).

Differential expression analysis. Differential expression analysis was performed in R (v3.4.2) with edgeR (v3.20.9) after excluding genes with ≤0.5 counts per million (CPM) reads sequenced in >25% of samples. Datasets were normalized by the trimmed mean of M-values (TMM) method (42), and differentially abundant genes (log fold change ≥ 1) were identified by the glmTreat test (74), requiring a Bonferroni-Holm corrected P < 0.05. For the iPEC data, this same process was used for each jackknifing iteration, which used 90% of samples in each group selected by random sampling without replacement. After 1000 jackknifing iterations, the one-sided, normal-based 95% confidence interval for gene-wise P values was calculated with statsmodels (v0.8.0) (75). The one-sided calculation was used because only transcripts with a P < 0.05 were of interest. Hierarchical clustering analysis was performed with squared Euclidean distance and average linkage.

AdaBoost. AdaBoost was performed in python with scikit-learn (v0.19.1, sklearn.ensemble.AdaBoostClassifier) (76). A 10% holdout subset of iPEC samples were excluded from all machine learning activities. The remaining samples available for training machine learning were filtered to remove genes with a CPM ≤0.5 in >25% of samples and TMM-normalized. The log(CPM) values of transcripts were then standardized (sklearn.preprocessing.StandardScaler) to mean 0 and SD 1 before fitting classifiers.

Optimal hyperparameter values were determined by random search over 1000 iterations with threefold stratified cross-validation and using Matthews correlation coefficient to quantify performance (fig. S6) (81). The number of estimators was sampled from a geometric distribution (scipy.stats.geom, P = 0.004, loc = 7), whereas the learning rate was sampled from an exponential distribution (scipy.stats.expon, loc = 0.08, scale = 2). Three iterations of the search showed the highest performance, and the median value for each hyperparameter was selected for further use (500 estimators and 1.6 learning rate).

AdaBoost models were trained with 10-fold stratified cross-validation to obtain robust estimates of PE classification capabilities. The full AdaBoost model training strategy is illustrated in fig. S7. This strategy followed a two-step approach. In the first step, five subsets of samples were created from the training data. Four subsets (80% of training data) were combined and used to fit AdaBoost, and one subset (20% of the training data) was used to assess performance during feature pruning. During feature pruning, the performance measure was Matthew’s correlation coefficient, and the model with the highest value was retained. In the case of a tie, the model with the fewest transcripts was retained. Because of the probabilistic nature of AdaBoost, model composition varied each time a model was fit. Therefore, to increase the likelihood that the most robust estimators were highly represented, fitting and feature pruning were repeated 10 times, generating 10 models for a subset. This process was repeated five times, in each round holding out one subset for pruning and combining the four others for fitting. This step ultimately produced 50 total models.

In the second step, the estimators from all 50 models were combined to generate a single aggregate AdaBoost model. This ensemble then underwent feature pruning, in which the importance measure for each transcript was the number of models in which it appeared, and the performance measure was log loss. The model with the best log-scaled average performance across all pruning subsets was selected as the final ensemble.

Within each fold, the validation samples were fully excluded from model training but were used to construct ROCs and determine the score threshold that maximized both sensitivity and specificity. The status of the iPEC holdout samples and the PEARL PEC independent cohorts were then classified with each of the 10 AdaBoost models from the cross-validation.

Statistical analysis

Unless otherwise noted, all statistical testing was two-sided with α = 0.05, and all point estimates and measures of uncertainty for population parameters are reported as means ± SD. The Shapiro-Wilk test was used to confirm whether data were normally distributed, if not, nonparametric statistical tests were used. P values were adjusted for multiple comparisons via Bonferroni-Holm or Tukey’s HSD post-hoc tests.


Materials and Methods

Fig. S1. C-RNA sample preparation workflow.

Fig. S2. Effect of plasma volume on C-RNA data quality.

Fig. S3. Integrity of C-RNA pregnancy signal after overnight storage in different BCTs.

Fig. S4. C-RNA transcripts can be altered during specific stages of pregnancy.

Fig. S5. Comparing pregnancy-associated transcript tissue specificity for three independent C-RNA studies.

Fig. S6. AdaBoost hyperparameter optimization.

Fig. S7. AdaBoost training strategy.

Fig. S8. AdaBoost output is affected by sample selection.

Table S1. PEARL HCC GA distribution.

Table S2. PEARL HCC pregnancy progression transcripts.

Table S3. iPEC diagnostic and inclusion/exclusion criteria for PE with severe features.

Table S4. Medications in use in iPEC.

Table S5. Medical center collection site patient distribution for iPEC.

Table S6. Study characteristics for PEARL PEC.

Table S7. TaqMan Probes for qPCR validation.

Table S8. Transcripts with C-RNA abundances altered in early-onset PE with severe features.

Table S9. Transcripts used by AdaBoost for classification of early-onset PE with severe features.

Data File S1. Raw whole-transcriptome sequencing counts for PEARL-HCC.

Data File S2. Raw whole-transcriptome sequencing counts for iPEC.

Data File S3. Raw whole-transcriptome sequencing counts for PEARL-PEC.

Data File S4. Primary data for constructing figures.


Acknowledgments: We thank S. Kyriazopoulou Panagiotopoulou and D. Kashef from the Illumina Deep Learning team for data analysis advice; L. McCoy from UTMB for assistance with enrolling patients and obtaining biospecimen and data; and V. Iyer and her team for specimen collection, processing, and data abstraction. Funding: All work was funded by Illumina Inc. Author contributions: S.M., G.G., and F.K. conceptualized the study. S.M., S.R., C.R.-H., G.G., L.L., M.R., S.Y., M.S., J.C., K.B., and K.O. curated data. S.R. and S.K. performed analysis and developed computational methods. S.M., C.R.-H., S.D., N.A., and C.T. performed experiments. S.M., S.R., C.R.-H., S.K., S.D., C.T., M.S., and K.O. developed experimental methods. G.G., L.L., M.R., S.Y., M.S., J.C., K.B., K.O., E.B., E.N., R.W., and G.S. oversaw patient sample collection. S.R. generated data visualizations. S.M. and S.R. wrote the original manuscript draft. S.M., S.R., C.R.-H., S.K., A.K., E.N., G.S., and F.K. revised the manuscript. Competing interests: S.M., S.R., C.R.-H., S.K., S.D., N.A., C.T., A.K., G.G., L.L., M.R., S.Y., M.S., J.C., K.B., K.O., and F.K. hold shares in and were paid employment (past or present) by Illumina Inc. E.N. is a consultant or advisory board member to Hologic, Natera, Seracare, and Illumina Inc. G.S. consults for AMAG Pharmaceuticals, Gestivision, Sera Prognostics, and Medicem. None of these consultant positions are related to the study presented here. One or more embodiments of one or more patents and patent applications filed by Illumina may encompass the methods, reagents, and data disclosed in this manuscript: PCT/US2019/033964, Circulating RNA Signatures Specific to Preeclampsia by S.M., S.R., C.R.-H., S.K., and F.K; US 62/939,324, Additional Circulating RNA Signatures Specific to Preeclampsia by S.M., F.K., S.K., S.R., and C.R.-H. Data and materials availability: All data associated with this study are present in the paper or the Supplementary Materials. Primary data used to generate figures are in data files S1 to S4. All sequencing data from the iPEC and PEARL HCC/PEC are available on NCBI dbGaP, accession phs002017.v1.p1.
View Abstract

Stay Connected to Science Translational Medicine

Navigate This Article