A novel strategy for rapid sepsis diagnosis based on cfDNA
Following the procedures shown in Fig. 1a, b, we developed a two-step approach for rapid sepsis diagnosis, which has been validated by the cross validation and an independent dataset. For the cross-validation, first, we identified 3546 bacterial species through alignment and classification of cfDNA sequencing reads from 118 healthy and 38 sepsis samples. A list of corresponding P-values by T-test, which were generated for measuring the difference between sepsis and healthy samples from study 1 (No. PRJEB13247) and study 2 (No. EGAS00001001754) respectively, was provided as Additional file 1: Table S1. All samples are randomly partitioned into two groups: 2/3 (78 healthy samples and 25 sepsis samples) for training and 1/3 (40 healthy samples and 13 sepsis samples) for testing. For each species, we fit a Beta distribution based on its bacterial-abundance vector with 78 elements from the healthy training samples. Then the 25 abundances from the sepsis training samples were tested one by one against the Beta distribution, to generate 25 P-values. Here a species was considered as a candidate pathogen if at least one satisfying P-value < 0.01. By such a filtering procedure, about 220 candidate pathogenic bacteria were selected. Figure 2 shows some examples of these candidate pathogens, which have significantly different distributions between the bacterial abundances of healthy and sepsis samples.
Second, based only on the observed abundances of the candidate pathogenic bacteria, we trained the Random Forest with balanced subsampling to generate an accurate classifier. Finally, we used this classifier to test the other one-third of normal and sepsis samples reserved for this purpose. The above pipeline was repeated 1000 times through bootstrap. As shown in Fig. 3a, the average out-of-bag error (OOB error) was 0.16 when there were a sufficiently large number of decision trees (> 100). The performance of the diagnosis strategy is satisfactory, with an average AUC of 0.926, sensitivity of 0.91 and specificity of 0.83. As an alternative, we also tried a logistic regression approach as a comparison (average AUC 0.77, sensitivity of 0.71 and specificity of 0.80) (Fig. 3b). The ranked list of the candidate bacterial species with respect to their importance in the Random Forest model is provided in Additional file 2: Table S2.
For the validation of an independent dataset, the 118 healthy and 38 sepsis samples respectively from study 1 (No. PRJEB13247) and study 2 (No. EGAS00001001754) were used as the training set, and samples from study 3 (No. PRJNA507824) was set as an independent validation. The AUC shows that the proposed method also performs well in the independent dataset (Fig. 3c).
Bacterial co-occurrence networks based on cfDNA
Using the bacterial abundance matrices from 78 healthy and 25 sepsis samples for training, we constructed two bacterial co-occurrence networks (Fig. 4a). Each network contains 224 nodes, representing the 224 candidate pathogenic bacteria that were selected for having significantly different abundance distributions between healthy and sepsis samples. As mentioned above, blood can contain cfDNA fragments released by the bacteria inhabiting all human body sites. Thus, we expect the co-occurrence networks of healthy and sepsis samples to include some associations among “harmless” species that are generally not involved in sepsis. In order to focus on sepsis-specific associations, we generated a differential network by excluding from the sepsis co-occurrence network all association patterns also found in the healthy co-occurrence network (Fig. 4a). We found 19 clusters (Fig. 4b) of species in the differential network, which are the strongly connected components visible in Fig. 4a. In the 25 sepsis samples, all the species in a cluster are strongly correlated in terms of their abundance levels. The detailed cluster information is provided in Additional file 3: Table S3.
In order to analyze the biological features of the clusters, we characterized the species in each one according to three aspects: respiration mode, metabolic habitat, and growth rate.
First, among all candidate pathogen species, 35.52%, 3.66%, and 52.12% are anaerobic, aerobic, and facultative respectively (the remaining 8.7% are unknown). Most of the clusters show similarity in terms of respiration mode: 9 clusters exhibit a preference for facultative species (clusters 3, 5, 6, 10, 14, 15, 16, 17 and 19), and 7 clusters exhibit a preference for anaerobic species (clusters 1, 2, 7, 11, 12, 13 and 18). The few anaerobic species in the sample do not dominate any cluster.
Second, before causing infection in blood, these bacteria usually originate in specialized metabolic environments. Bacterial metabolic habitats are divided into 4 types: host-associated, terrestrial, aquatic, and diverse. The species in clusters 3, 4, 5, 9, 14, 15, 17, 18, and 19 are mainly host-associated, the species in cluster 10 are mainly terrestrial, the species in cluster 3 are mainly aquatic, and clusters 1, 6, 7, 10, 12, 13, 16 contain species from diverse metabolic environments.
Third, bacterial growth is significantly correlated with metabolic variability and the level of co-habitation. Doubling-time data have led to the important finding that variations in the expression levels of genes involved in translation and transcription influence growth rate [34, 35]. We partition the clusters into two groups according to the doubling time of their member species: “fast” and “slow” growing clusters are those whose median duplication time is shorter or longer than the mean over all species by at least one standard deviation [36]. The median doubling time for species distributed in cluster 6, 7, 11 and 13, is larger than 1 (fast growing clusters), while doubling time for members in cluster 1, 3, 4, 5, 15, 16 is smaller than 0.6 (slow growing clusters). Note that fast growth rates are typical of species that exhibit ecological diversity, so the identification of “fast” clusters accords with the metabolic habitats analyzed in the previous paragraph.
For the pathogens of each cluster, a specific therapy of antibiotics could be provided [37]. A list of possible antibiotics that might be used for each of cluster is shown in Additional file 3: Table S3.
Inferring missing bacteria from identified species
A given patient with sepsis can carry multiple pathogens [38]. Therefore, knowledge of all bacteria present is crucial if we are to provide fast and effective antibiotic treatment. At the same time, the pathogenic species span a wide range of growth strategies and environmental requirements (such as aerobic or anaerobic, acidity, etc.), which makes it difficult to detect all species in a single culture. Moreover, due to the limited volume of a blood sample, not all pathogenic species can be identified from cfDNA. In short, unobserved bacterial species are a major obstacle to effective treatment.
Based on the bacterial co-occurrence network, it is possible to infer missing bacterial species from the identified species. Specifically, having constructed a bacterial co-occurrence network, we know that some species usually have consistent abundance levels in sepsis samples. Thus, when some species from a cluster are identified in a sepsis sample, statistically it is highly probable that all members of the cluster are present. We can infer the presence of “missing” bacteria in this way, if the missing bacteria belong to a cluster.
To test the effectiveness and robustness of this bacteria-inferring scheme, a certain percentage of species were randomly removed from the identified species pool for each sample for both cross-validation and the validation of an independent dataset. We tried to infer the presence of the missing bacteria from the remaining species, based on the bacterial co-occurrence network. Figure 5a, c show that the recovery rate is about 50–60%, decreasing gradually with higher removal rates. And the overall results are quite satisfactory, as seen in Fig. 5b, d. The total number of species recovered (including those not randomly removed) is still 60%, even when 80% of the observed species were randomly removed. These results demonstrate the effectiveness of a bacterial co-occurrence network to infer the presence of unobserved bacteria from identified species. This method has great potential, especially in cfDNA-based analysis, because in a 10 ml blood sample there is a very limited amount of cfDNA, and only a small proportion of that is microbial cfDNA.