Skip to main content
Fig. 1 | Journal of Translational Medicine

Fig. 1

From: Robust SNP-based prediction of rheumatoid arthritis through machine-learning-optimized polygenic risk score

Fig. 1

Summarised pipeline employed to identify predictors of RA. 978 RA case samples were split into a single training dataset (N = 599) and three test sets (N = 125/127/127). To maintain the ratio (61.2%/12.8%/13%/13%) between case and controls, the 2732 population control samples were similarly split in the same proportion, with a single training dataset (N = 1673), and three test sets (N = 349/355/355). Subsequently, the individual datasets were merged based on the common SNPs between both case and control datasets. The resultant training dataset was subjected to SNP filtering based on minor allele frequency genotype missingness or deviation from Hardy–Weinberg equilibrium. Missing genotypes were imputed using Beagle 5.0 initially and supplemented with machine learning imputation using the Bayesian Ridge algorithm. Training set was further divided into eight subsets of varying sample sizes prior to the implementation of recursive feature elimination with cross-validation (RFECV) using a Random Forest estimator. Commonly selected features following RFECV across the eight subsets were determined followed by stepwise inclusion of each of the commonly selected features based on their feature importance scores to identify the minimum number of features required to achieve an optimal performance metrics. The minimum features will then be determined as the final optimal feature set based on the evaluation of their predictive capacity across five diverse ML classifiers using cross-validation and separately in the three independent unseen test datasets. Likewise, a univariate logistic regression was used to establish the effect sizes of selected features for the calculation of the polygenic risk scores (PRS). PRS was also evaluated for its predictive capacity across the same five ML classifiers using cross-validation and separately in the three independent unseen test datasets. Finally, a PRS-Risk calculator for RA was developed to facilitate the calculation of PRS and RA-risk by providing the genotypes of the selected features of patients as inputs

Back to article page