Identification of blood-based key biomarker and immune infiltration in Immunoglobulin A nephropathy by comprehensive bioinformatics analysis and a cohort validation

Background To identify the critical genes in the onset and progression of Immunoglobulin A nephropathy (IgAN) and to explore its immune cell infiltration feature. Methods Differentially expressed genes (DEGs) were firstly screened from 1 blood-derived dataset GSE73953 and a glomerulus derived dataset GSE93798 through limma analysis, overlap genes omitting and weighted gene correlation network analysis (WGCNA) and further reduced according to expression pattern and correlation with the clinical features: eGFR and proteinuria, followed by external validation using the GSE37460 dataset and an IgAN cohort. In addition, the CIBERSORT tool for immune cell infiltration analysis, ceRNA network construction and Connectivity Map (CMAP) were also performed. Results A total of 195 DEGs were found, and among them, 3 upregulated (ORMDL2, NRP1, and COL4A1) and 3 downregulated genes (ST13, HSPA8 and PKP4) are verified to correlate clinically, and finally ORMDL2, NRP1 and COL4A1 were validated in patient cohort and with the ability of IgAN discrimination (highest AUC was COL4A1: 97.14%). The immune cell infiltration results revealed that significant differences could be found on resting memory CD4 T cells, activated NK cells, and M2 macrophages between control and IgAN. Conclusions Our results demonstrated here that significantly upregulated DEGs: ORMDL2, NRP1 and COL4A1, could be served as the diagnostic marker for IgAN, and dysregulated immune cell infiltration hinted possible the immune system intervention point in the setting of IgAN. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-022-03330-w.


Background
Immunoglobulin A nephropathy (IgAN), characterized by glomerular IgA deposits on renal biopsy, is one of the most common primary glomerular disease and remains a leading cause of chronic kidney disease and end-stage kidney disease (ESKD) [1][2][3]. The common clinical manifestations include hematuria, fever and different degree of proteinuria [4,5]. The estimated incidence of IgAN is about 2-10 per 100,000 person-years [6,7] and a higher disease burden has been reported in the East

Open Access
Journal of Translational Medicine Asian countries, including China [2]. To realize quick diagnosis using a less invasive methodology, blood-based biomarkers rather than renal biopsy that could be more preference in early indication and urgently needed in the clinical practice.
Recent advances in the microarray and transcriptomic technology allow us to obtain a landscape view of the differentially expressed genes (DEGs) in a certain kind of disease [8][9][10]. Several studies have been performed to assess the expression profiling of the samples from IgAN. For example, Nagasawa et al. [11] performed the microarray using 15 IgAN patients derived peripheral blood mononuclear cells (PBMCs), Liu et al. [12] employed microarray analysis of the glomerular compartment of renal biopsy specimens from 19 IgAN patients and Guo et al. [13] used microarrays to analyze the transcriptome of microdissected renal biopsies from 27 patients with IgAN. However, certain bias could be presented in these isolated experiments due to the complicated human genetic background and different experimental conditions. Integrative bioinformatics analysis can provide the possibility of multiple group verification of the DEGs, thereby facilitating the process of novel and reliable biomarkers identification.
In the present study, we aimed to identify the critical genes in the onset and progression of IgAN via integrative employment of blood-based microarray data and glomerular tissue-related data, thereby identifying the novel biomarkers for IgAN diagnosis. Moreover, we also explored immune cell infiltration feature using the glomerular-related data to discover the possible treatment intervention for IgAN.

Patients recruitment and clinical data collection
A total of 7 IgAN were recruited from the Department of Nephrology, the First Affiliated Hospital of Soochow University for peripheral blood mononuclear cells (PBMCs) collection from June 2021 to September 2021.
The diagnosis of IgAN were based on The Oxford Classification for IgAN and our previous publication [14,15], and all the included IgAN patients were primarily confirmed with histological features of deposition of IgA in the mesangial area by immunofluorescence. Patients who were presented with secondary IgAN, such as allergic purpura, chronic hepatitis B, were excluded. The clinical data include gender, serum creatinine, estimated glomerular filtration rate (eGFR), and 24 h proteinuria was collected (Table 1). Moreover, PBMCs were also obtained from 5 healthy volunteers as controls. The current study protocol was approved by the Ethical Committee of the First Affiliated Hospital of Soochow University and adhered to the requirement of the Declaration of Helsinki. All the subjects were required to provide written informed consent.

Microarray data and related clinical data acquisition
The mRNA expression and related clinical data about IgAN were downloaded from Gene Expression Omnibus (GEO) (http:// www. ncbi. nlm. nih. gov/ geo/) using the search terms "Immunoglobulin A nephropathy", "IgA nephropathy", "IgAN" and "expression profiling by array". The gene expression microarray datasets GSE73953, GSE93798 and GSE37460 were selected and downloaded. The criteria for dataset selection were as follows: human clinical samples with detailed clinical and gene expression information. Among these datasets, GSE73953 and GSE93798 were used for differentially expressed genes (DEGs) screening, and patient eGFR and proteinuria data downloaded from The Nephroseq v5 analysis engine (https:// v5. nephr oseq. org) was used for correlated gene screening. GSE37460 and above cohort containing IgAN and control subjects were used for expression-level validation of the 6 identified eGFR and proteinuria correlated genes at different tissue composition levels. Detailed information on these microarray datasets is listed in Additional file 11: Table S1.

Identification of the candidate genes via integrated bioinformatic analysis
The entire process of the analysis was shown in Fig. 1 and the detailed screening of the DEGs contains 3 steps as follows:

Data processing
The "limma" package in R [16] (http:// www. bioco nduct or. org/) was used for background correction, normalization and differential expression analysis using IgAN and control samples from GSE73953 and GSE93798. Initial screening of the DEG was performed using the p < 0.05 as the criteria, and was illustrated as Venn maps via an online tool (http:// bioin forma tics. psb. ugent. be/ webto ols/ Venn/) ( Fig. 2A). Then the intersected DEGs from the 2 datasets were employed in the following analysis ( Fig. 2B).

Construction of weighted gene co-expression network analysis (WGCNA)
The "WGCNA" package in R [17] was employed for the biologically meaningful module information mining based on pairwise correlations between genes from highthroughput data. Co-expression networks were built using the intersected DEGs and corresponding clinical information from above step to illustrate strong and weak correlations between genes. To fit for the scale-free network, the square of the between genes correlation coefficient was calculated to pick up the optimal soft threshold 5 (Fig. 2C, left), followed by the mean connectivity calculation of the corresponding soft threshold (Fig. 2C, right). Then, the adjacency and topological overlap matrix (TOM) similarity matrices were generated based on the selected soft threshold for modules detection (generating dendrogram), followed by the module assignment under the dendrogram (Fig. 2D) and correlation of the modules to the clinical traits to identify the important DEGs (Fig. 2E).

Clinical correlated genes identification and functional enrichment analysis
The Nephroseq v5 analysis engine (https:// v5. nephr oseq. org) provides the gene expression features and clinical characteristics. Pearson correlation analysis was performed using the DEGs from 2.3.2 to identify the genes that correlated with eGFR and proteinuria. Finally, protective genes with a definition of eGFR negatively correlated and proteinuria positively correlated were selected. Gene Ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed with the "clusterProfiler" package in R [18]. Disease Ontology (DO) enrichment analyses were performed on DEGs using the "clusterProfiler" and "DOSE" packages in R [19].

Immune cells infiltration analysis
To quantify the relative proportions of infiltrating immune cells based on the gene expression matrices from above IgAN datasets (GSE93798 and GSE37460), a bioinformatics algorithm named CIBERSORT (https:// ciber sortx. stanf ord. edu/) [20] was employed for immune cell infiltrations calculation. The putative abundance of immune cells was estimated using a reference set with 22 types of immune cell subtypes (LM22) with 1,000 permutations. Violin plots were drawn using the "ggplot2" package in R to visualize the differences in immune cell infiltration between the IgAN and control samples. Correlation analysis and visualization of 22 types of infiltrating immune cells were performed using the "corrplot" package in R. Moreover, the relationships between the clinical correlated DEGs and 22 types of immune cells were also visualized.

ceRNA network construction and connectivity map (CMAP) analysis
The interaction of DEGs identified from 2.3.3 with the possible miRNA were predicted using TargetScan (http:// www. targe tscan. org) database, and the further interaction between miRNA and circRNA or lncRNA were constructed using StarBase (http:// starb ase. sysu. edu. cn) database. In order to simplify the interaction links, CircRNAs or lncRNAs candidates with the weak interaction with miRNAs were removed using the ClipExp-Num software from StarBase. Moreover, Connectivity Map (https:// clue. io/), an online database that relates disease, genes, and drugs based on similar or opposite gene expression signatures, was used for potential drug prediction. Briefly, the DEGs were compared with the reference data from the database to obtain a correlation score (− 100 to 100), and the small molecule compounds with a mean coefficient less than − 98 were ranked and selected.

Real-time reverse transcription PCR to verification of the clinical related genes
PBMCs were obtained from the IgAN and healthy controls via density gradient centrifugation using the Ficoll reagents from Solarbio (Wuhan, China). RNA extraction was employed the RNAeasy reagent (Vazyme, Nanjing, China) whereas reverse transcription was performed using the HiScript III 1st Strand cDNA Synthesis Kit (Nanjing, China) according to manufacturer's instruction, and real-time PCR was carried out using SYBR ® Green Master Mix (Vazyme, Nanjing, China) on the ABI Ste-pOne PlusTM real-time PCR system. The Relative mRNA expression was calculated using the 2 −ΔΔCt method in a triplicated manner. Primers were as described in Table 2.

Statistical analysis
Statistical analysis was performed using R version 4.1.0. Quantitative data are expressed as the mean ± standard deviation (SD), and qualitative data are expressed as numbers and percentages. Unpaired t-tests and Mann-Whitney U tests were respectively used for parametric and nonparametric 2-group comparisons of quantitative data. Correlation analysis between the clinical parameters and gene expression level was performed using Pearson correlation. The diagnostic efficacy was calculated by receiver operator characteristic (ROC) curves and area under the curve (AUC) using the "sur-vivalROC" and "ROCR" packages in R. Unless specifically mentioned, p < 0.05 was considered statistically significant.

Identification of the candidate genes using GEO datasets and gene enrichment analysis
To find critical genes involved in the IgAN development and progression, we employed the 1 GEO dataset using PBMCs samples (GSE73953) and 1 GEO dataset using the glomerular part of the human kidney (GSE93798) as the training sets to perform the DEGs analysis. After application of the criteria of p < 0.05, a total of 4583 and 9072 genes were respectively found in GSE73953 and GSE93798 ( Fig. 2A, B; Additional file 1) and the intersection of the genes are 1766 (Additional file 1). To further decrease the gene number, WGCNA was applied to confirm the IgAN phenotype-related genes ( Fig. 2C-E) and a total of 426 genes were found (Additional file 2). After confirmation of the consistent expression pattern in 2 datasets, a total of 86 up-and 109 down-regulated genes were found ( Fig. 3A; Additional file 3). Then, we performed the GO enrichment, KEGG pathway and disease ontology analysis (Fig. 3B-D), the results showed that response to unfolded protein, response to topologically incorrect protein and homeostasis of the cell number are the top 3 GO terms, MAPK signaling pathway, transcriptional misregulation in cancer and Estrogen signaling pathway are the top 3 KEGG pathways. For disease ontology terms, peripheral nervous system neoplasm, autonomic nervous system neoplasm and neuroblastoma ranked top 3 terms, and kidney disease and urinary system disease ranked 5 and 6 among the top 10 terms.

Identification of the genes of clinical significance
Since the Nephroseq v5 analysis engine allows to explore the relationship between gene expression levels and clinical characteristics, we further performed the correlation analysis using the gene expression characteristics and 2 most critical clinical characteristics in IgAN, eGFR and proteinuria (Additional files 4, 5).
Using the protective genes defined in the methods part, we totally found 3 upregulated (ORMDL2, NRP1 and COL4A1; Fig. 4A, left) and 3 downregulated protective genes (ST13, HSPA8 and PKP4; Fig. 4A, right). Moreover, we also provided the correlation maps of these 6 genes, and a relatively high correlation parameter R-value (≥ 0.45) and significant p-value (all < 0.05) were found (Fig. 4B). Via above 3-step of integrated bioinformatics analysis and combination of the clinical characteristics, a total of 6 genes were found.

Validation of the expression pattern and evaluation of diagnostic efficacy of above identified genes
To further validate the clinical significance of 6 genes, we employed 1 GEO dataset (GEO37460) and a small cohort of patients from our hospital to verify. We firstly confirmed our results in 2 training sets, GSE73953 and GSE93798 (Fig. 5A, B), and the results showed that consistent up-and downregulated expression pattern and high AUC values of all the 6 genes according to the ROC curves. Further validation using the dataset of GSE37460 revealed that significantly upregulated genes were ORMDL2, NRP1 and COL4A1, and significantly downregulated genes were ST13 and PKP4. Therefore, HSPA8 is omitted in the following cohort analysis due to the absence of differences. Moreover, according to the results from ROC curves, highest AUC could be observed using the gene NRP1 and COL4A1 (both are 99.25%; Fig. 5C). In our cohort, DEGs with consistent pattern were only confirmed on 3 upregulated genes, ORMDL2, NRP1 and COL4A1 (Fig. 6A-C), but not on PKP4 and ST13 ( Fig. 6D-E). Moreover, according to the ROC curves, highest AUC could be observed using the gene COL4A1 (97.14%) (Fig. 6F). These results suggested that significantly upregulated genes ORMDL2, NRP1 and COL4A1 could be served as the gene marker to differentiate IgAN and controls.

ceRNA network construction and CMAP results
Since non-coding RNAs (including circRNAs, lncRNA and miRNA) are considered to play an important role in the regulation of the function of mRNA. To better understand the regulatory network of mRNAs found in the above analysis, we constructed 2 ceRNA networks. The results about lncRNA-miRNA-mRNA network and mRNA-miRNA-circ-RNA network are shown in Fig. 7A, B (Additional file 6). Strong interactions could be found in the genes COL4A1, NRP1 and ORMDL2 with 6 miR-NAs, 16 lncRNAs and 13 cirRNAs. These data provided a preliminary view of the regulatory network of our identified mRNAs.
Moreover, according to the expression profiling the DEGs from 2.3.3, a total of 17 small-molecule compounds were identified as potential therapeutic agents from CMAP analysis (Table 3; Additional file 7).

Results of immune cell infiltration
The autoimmune disease property of IgAN made it necessary to explore the possible dysregulation of immune cells, whereas the dataset GSE93798 providing the expression profiling data from glomerulus made it possible to fulfill this goal. According to the immune cell infiltration results (Fig. 8A; Additional file 8), significant difference could be found on naive B cells (p = 0.016), resting memory CD4 T cells (p < 0.001), resting and activated NK cells (p = 0.021 and < 0.001), M1 and M2 macrophages (p = 0.016 and 0.020), activated mast cells (p = 0.038), and neutrophils (p < 0.001) between control and IgAN.
Moreover, the correlation of 22 types of immune cells were also calculated, the results revealed that naive B In addition, we used the external dataset GSE37460 to validate the above identified 8 different infiltrated immune cells between IgAN and normal controls. The violin plot of the immune cell infiltration showed that, compared with the normal control, resting memory CD4 T cells (p < 0.001), activated NK cells (p < 0.001), and M2 macrophages (p < 0.001) were confirmed as difference (Fig. 8C).

Immune cell infiltration correlation analysis of 3 validated genes
Furthermore, we also explored the correlations between 3 validated genes and different immune cell types, the

Discussion
With the biomedical research entering into the big data era, steadily increased number of algorithms and analysis tools are proposed by different research groups each year; moreover, multiple types of omics data about a specific type of disease are also repeated assessed by different groups. To obtain full insights into the systemic and pathological effects of omics data, an integrative bioinformatics analysis of these datasets by using the effective algorithms or tools have become the top priority, which can facilitate the further exploration by "Wet" experiments.
In present study, we employed limma analysis, overlap genes omitting and WGCNA to obtain DEGs from 1 blood-derived dataset and a glomeruli-derived dataset, and a total of 195 DEGs were found. Then we found 3 upregulated and 3 downregulated DEGs via checking of expression pattern and clinical features correlation. Finally, through the external dataset and an IgAN cohort validation, 3 genes including ORMDL2, NRP1, and COL4A1 were confirmed with the ability of IgAN discrimination, and the highest AUC was found by COL4A1, which is 97.14%.
During the datasets screening process, we also found another 2 datasets (GSE35488 and GSE14795) related with IgAN in the GEO website. In GSE35488 [21], RNA from tubulointerstitial compartments was extracted and processed for microarray analysis and the authors used these tissues for screening for proteinuria related DEGs. Finally, they identified an albumin-regulated 11-gene signature, which is shared in all forms of glomerulonephritits, including IgAN and membranoproliferative glomerulonephritis. Due to the poor gene clustering (Additional file 10) and inaccurate tissue derivation, this dataset was omitted. For GSE14795 [22], the whole blood cells from 12 IgAN patients were used for analysis using a specific commercial system, and DEGs were identified using SAM method, however, the DEGs obtained by us using the same SAM method showed significant difference from those reported in the papers, and we guess that the data from this GEO dataset is incomplete.
In the DEGs discovery process, we found 5 genes are of clinical significance. Among them, ORMDL2 is considered as a negative regulator of sphingolipid synthesis and is predicted to be located in the endoplasmic reticulum. Clarke et al. [23] reported that ORMDLs are able to restrain sphingolipid metabolism, thereby limiting levels of dangerous metabolic intermediates that can interfere with essential physiological processes such as myelination, whereas Bugajev et al. [24] demonstrated that ORMDL2 deficiency could affect mast cell signaling during the systemic anaphylactic reaction. Due to an abnormal sphingolipid metabolism could contribute to various pathologic processes, including kidney diseases [25], and activated mast cells were revealed according to the immune infiltration results, it is highly possible that ORMDL2 may an important role in IgAN. NRP1, a transmembrane receptor protein, has been reported to play an important role in several kidney diseases, including regulation of apoptosis and cytoskeleton organization of cells located in glomerular during the process of diabetic nephropathy [26], and prediction of renal outcome in patients with lupus nephritis [27]. COL4A1 encodes the α1 chain of collagen IV, which a major component of basement membranes and its mutation could result in glomerulocystic kidney disease [28][29][30]. According to previous reports [31,32], macrophage can directly contribute to collagen generation in wound healing and other pathological conditions. They suggested that primary function of type VI collagen secreted by macrophages could be cell-cell (immune cells) and cell-matrix interactions modulation and production of type VI collagen is a marker for a nondestructive, matrix-conserving macrophage phenotype that could profoundly influence physiological and pathophysiological conditions in vivo. PKP4 is considered to play a role as a regulator of Rho activity during cytokinesis and is confirmed as one of DEGs in glomerular diseases reported by Ding et al. [33]. Most recently, Raby found that PKP4 is presented in the urine exosome in patients with Autosomal Dominant Polycystic Kidney Disease (ADPKD) as poor treatment responder biomarker [34]. In our cohort validation, we found that PKP4 is also upregulated in IgAN compared to normal controls, and it is omitted due to inconsistent expression pattern between the datasets and our patient cohort. HSPA8 is a chaperone protein with important roles in various cellular processes, including autophagy [35] and protection from reactive oxygen species (ROS) production by mitochondria during inflammatory conditions [36]. Kim et al. [37] reported that HSPA8 could be used as an effective and sensitive non-invasive biomarker for the assessment of nephrotoxicity, whereas Wen et al. [38] suggested that possible the role of HSPA8 in organ fibrosis in diabetic kidney disease (DKD) patients. In a recent study by Lin et al. [39], downregulated HSPA8 was found in IgAN patients, however, we did not find an obvious difference on HSPA8 in our discovery datasets. Therefore, further confirmation using more subjects might be carried out. No report was found on the role of ST13 in the kidney disease, whereas most of studies about ST13 are about the cancer biology [40]. Furthermore, ST13 was found with no significant downregulation in our validation cohort. We attributed the reasons that limited patient number or possible the error in the microarray analysis.
For the immune infiltration results, we found that differentially ratio of naive B cells (p = 0.016), resting memory CD4 T cells (p < 0.001), resting and activated NK cells (p = 0.021 and < 0.001), M1 and M2 macrophages (p = 0.016 and 0.020), activated mast cells (p = 0.038), and neutrophils (p < 0.001) between control and IgAN. B cells may be involved in the production of galactose-deficient IgA1(Gd-IgA1) and its antibodies in IgA nephropathy [41], and decreased number of naive B cell could be resulted by more of activated B cells. Moreover, Eijgenraam et al. [42] showed previously that dendritic cells of IgAN patients have an impaired capacity to induce IgA production in naive B cells. As heterogeneous cells of the innate immune system, macrophages can fluidly modulate their phenotype in response to the local microenvironment, and macrophage polarization is found during chronic kidney disease (CKD) [43], therefore, it is reasonable to observe elevation of M1 and M2 macrophages in IgAN. Moreover, activated mast cells mediated antibody production [44], neutrophils caused glomerular injury [45], NK induced hematuria [46] and T cell-induced Gd-IgA1 synthesis elevation [47] are reported according to previous reports during the IgAN. In addition, according to the results from ceRNA network, miR-135a-5p and miR-135b-5p are found significantly correlated with the gene ORMDL2 found here, in the recent studies by Pawluczyk et al. [48] and Min et al.
[49], miR-135a-5p showed differentially expressed in IgAN patients, whereas miR-135b-5p was confirmed as the differentially expressed miRNA using the urinary exosome samples from IgAN patients. Moreover, circulating There are also some limitations in present study. First, the number of validation cohort is limited and no followup information is available, thereby resulting in possible the bias on the results and inability of examination of the genes in the prognosis. Second, although several immune cells were found dysregulated and the results were consistent with previous reports, the detailed function of these immune cells and the exact molecular events during the IgAN remains unknown. Therefore, further clinical validation with a large cohort of patient and experimental verification of these genes are required in near future.

Conclusions
In summary, we demonstrated here that significantly upregulated DEGs: ORMDL2, NRP1 and COL4A1 could be served as the diagnostic marker for IgAN and dysregulated immune cell infiltration hinted possible the immune system intervention points in the setting of IgAN.