### Subjects

In order to rate an individual’s data with respect to a control population, a large number of standardized data from a representative population sample is required. The Human Connectome Project (HCP) provides such a resource [17,18,19]. For this study, data from the HCP’s 1200 Subject Release was used. In total, structural data (T1- and T2-weighted sequences) from 1113 subjects was available at the time of this study (507 males, aged between 22 and 40). Of the 1113 subjects, 1000 were randomly selected for generating null distributions of cortical thickness, the rest was spared for subsequent validation (see below).

### Data acquisition and preprocessing

The HCP data was acquired on a 3 Tesla Connectome Scanner. Two different types of structural sessions were acquired, encompassing a T1-weighted MPRAGE (repetition time (TR) = 2400 ms, echo time (TE) = 2.14 ms, inversion time = 1000 ms, flip angle (FA) = 8°, field of view (FOV) = 224 × 224, voxel resolution (VR) = 0.7 mm^{3}, bandwidth (BW) = 210 Hz/Px, iPAT factor 2, total acquisition time 7 min 40 s) and a T2-weighted SPACE (TR = 3200 ms, TE = 565 ms, FA variable, FOV = 224 × 224, VR = 0.7 mm^{3}, BW = 744 Hz/Px, iPAT factor 2, total acquisition time 8 min 24 s). The full imaging protocols can be found online at http://protocols.humanconnectome.org/HCP/3T/imaging-protocols.html. All study procedures of the HCP protocol were approved by the Institutional Review Board at the Washington University in St. Louis.

The HCP offers data which was preprocessed with standardized and validated procedures. The main preprocessing steps encompassed gradient distortion correction, brain extraction, nonlinear registration, surface registration, and registration onto high-resolution (164 k mesh) and low-resolution (32 k mesh) templates; more details on the exact preprocessing pipeline can be found in [9, 20,21,22]. The image format of the mesh images is in CIFTI format (Connectivity Informatics Technology Initiative), a file format which combines surface-based cortical data with volumetric-based subcortical/cerebellar data, which was found to enhance alignment to the geometry of the cortex as well as statistical power [23]. The HCP’s minimally preprocessed data include cortical thickness maps (generated based on the standardized FreeSurfer pipeline with combined T1-/T2-reconstruction [7, 8]). For this study, the high-resolution cortical thickness maps (164 k mesh) were used.

### Statistical analysis

Statistical analysis of the minimally preprocessed HCP neuroimaging data was carried out with tools from the Connectome Workbench [18, 19] and MATLAB R2019b (The Mathworks, Natick, USA). First, null distributions were generated using different strategies and subsequently, these methods were validated and compared based on their specificity and sensitivity.

#### Generating null distributions

Different strategies to generate null distributions were compared. These can be subdivided into (a) generating one common null distribution for all data points on the cortex (referred to as “vertices” in CIFTI mesh files) and (b) generating separate null distributions for distinct brain regions (Fig. 1a, b). Note that thickness spreads nonuniformly across the human cortex [24,25,26,27] such that different brain regions show different population means (Fig. 1c). Therefore, different null distributions for distinct brain regions might increase sensitivity of detecting atrophy, which is why both approaches were compared in the present study. The two approaches were subdivided further into more and less conservative statistical corrections, such that in total, four methods were compared. Null distributions were computed using nonparametric permutation procedures for all methods [28], since they make less assumptions than parametric models and are therefore considered more robust than parametric tests [29, 30].

##### Method 1: Z-min statistic per data point

The statistically most conservative approach was based on generating one common reference distribution for all 298,261 data points of the cortical surface. First, from 1000 HCP data sets, each data set was selected iteratively (“test data set”) and standardized with respect to the remaining 999 data sets (“control data sets”). For that, z-scores were calculated for each vertex using the formula z_{vertex} = (d_{vertex} – μ_{vertex})/σ_{vertex}, whereas d_{vertex} is the cortical thickness value of one vertex from the test data set, μ_{vertex} the mean value of that vertex from the control data sets and σ_{vertex} the respective standard deviation. From the resulting z-score map, only the minimum value was saved (note that the present research question specifically addresses cortical *thinning*). The result was a reference distribution consisting of 1000 z-scores. Using this distribution, each vertex of an independent validation data set can be rated separately with respect to the reference population, by z-transforming each vertex using the above formula (see section “Validation”).

##### Method 2: Z-min statistic per data point, averaged across brain regions

In method 1, a null distribution was calculated based on the most extreme values across the cortex. However, given that cortical thickness is nonuniformly distributed across the cortex physiologically [27], potential atrophy will be hard to detect in physiologically thicker brain regions. Method 2 aimed to increase the biological plausibility of the previous method. While the same null distribution was used as in method 1, in method 2, data points were summarized across anatomically distinct brain regions, defined by the Desikan–Killiany atlas [31]. This atlas subdivides the cortical surface into 68 regions based on morphologic features (“labels”, 34 on each hemisphere). For subsequent validation, statistical significance was determined for the synopsis of all vertices within each of the 68 regions, instead of for each vertex separately (see section “Validation”).

##### Method 3: Z-min statistic per brain region

In spite of the increased biological plausibility in method 2, that procedure was still based on one common null distribution from the most extreme values of the cortex. In method 3, this was corrected by calculating distinct null distributions for each of the 68 Desikan–Killiany-labels. For that, the permutation procedure described in method 1 was repeated, however now z-maps were calculated using the formula z_{vertex} = (d_{vertex} – μ_{Label})/σ_{Label}, whereas z_{vertex} was the z-score for a vertex of the test data set, d_{vertex} is the observed cortical thickness value for that vertex from the test data set, μ_{Label} is the mean value of the respective *label* from the control data sets and σ_{Label} its respective standard deviation. On each iteration, the minimum z-score of all vertices composing one common label was saved, such that the result was a 68x1000 matrix, providing a null distribution for each label (Fig. 1d). With these null distributions, each brain region can be rated separately with respect to the reference population, by converting the cortical thickness data into z-scores using the formula z_{Label} = (d_{Label} – μ_{Label})/σ_{Label} (Fig. 1e).

##### Method 4: Z-score per brain region

Finally, in method 4, null distributions were generated based on *averaging across all vertices* from each brain region instead of using each label’s most extreme values, as in method 3. Mean values were calculated for each brain region of the test data set to derive null distributions. These null distributions were generated in analogy to method 3, using the formula z_{Label} = (d_{Label} – μ_{Label})/σ_{Label}. Similar to method 3, also in method 4, each brain region can be rated separately with respect to the reference population, by converting the cortical thickness data into z-scores using the formula z_{Label} = (d_{Label} – μ_{Label})/σ_{Label}.

#### Validation

To validate and compare the proposed methods, specificity and sensitivity were calculated. These measures were calculated for each vertex (method 1) or each label (methods 2–4) separately. For that, the 113 data sets (“validation data sets”) from the 1113 HCP data sets were used which had been spared for the generation of null distributions (see section “Subjects”). Statistical inference tests based on the null hypothesis of no atrophy for a given validation data set were carried out using the above-generated null distributions. For each vertex/label, the number of values of the null distributions that were lower than the observed cortical thickness values in a given validation data set were counted. Dividing this sum by the number of permutations (n = 1000) yielded FWER-corrected p-values (p_{FWER}) [32, 33]. Vertices/labels with p_{FWER} <= 0.05 were considered to indicate lower cortical thickness values than would not be predicted by chance and therefore labeled as “atrophic”. In method 2, since data points were summarized within each label, a label was defined as “atrophic” if a certain percentage of its vertices showed p_{FWER} <= 0.05. Different percentages were tested (1%, 5%, 10%, 20%, 30%, 40%, 50%). Given that all of these thresholds yielded similarly poor results, hereafter only the results for one threshold (5%, arbitrary choice) are provided. The data for the other thresholds are provided in Additional files 1 and 2.

##### Specificity

Specificity defines the rate of true negatives, i.e. the share of patients which are correctly diagnosed as not having the condition of interest (here, “no atrophy”). The validation data set was used to calculate specificity, assuming that—given this data set was a random selection of a data set of healthy young subjects with no history of psychiatric/neurologic disorders—the validation data set can be labeled as non-atrophic. Each of the four methods was applied to all of the 113 validation data sets and specificity was defined as the percentage of vertices (method 1)/labels (methods 2, 3, 4) which were not classified as significantly atrophic. This procedure was repeated for each validation data set separately. Mean and standard deviations of the specificity calculations were determined across all 113 data sets (“cumulative specificity”).

To allow evaluation for distinct brain regions, in addition, specificity per atlas region was defined (for methods 2,3 and 4 only, since in method 1, no atlas regions were analyzed). This was done by calculating, per atlas region, the percentage of the 113 validation data sets which were not significantly classified as atrophic in that atlas region (“regional specificity”).

##### Sensitivity

Sensitivity defines the rate of true positives, i.e. the share of patients which are correctly diagnosed as having the condition of interest (here, “atrophy”). Given no true atrophy was assumed in the validation data sets, atrophy was simulated: Different degrees of atrophy were simulated as follows (Fig. 2): The original cortical thickness data (each vertex) was multiplied by a number between 0 and 1 (e.g. multiplication by 0.9 represents simulated atrophy of 10%, etc.). For each of the 113 validation data sets, atrophy was simulated from 1% to 100% in steps of 1 percentage points (p.p.). Then, each of the four methods was applied to all of the simulated data sets. For each method and degree of atrophy, sensitivity was calculated separately. Cumulative sensitivity was defined as the percentage of vertices (method 1) or labels (methods 2, 3, 4) which were classified as significantly atrophic, summarized across all 113 data sets (“cumulative sensitivity”). Sensitivity across methods was compared using the degree of atrophy required to achieve cumulative sensitivity of 80% (“cumulative sensitivity threshold”). Note that less sensitive methods will require more pronounced atrophy, therefore a higher cumulative sensitivity threshold, in order to detect atrophy.

To allow evaluation for distinct brain regions, additionally, sensitivity per atlas region was defined for each degree of atrophy (for methods 2,3 and 4 only, since in method 1, no atlas regions were analyzed). This was done by calculating, per atlas region, the percentage of the 113 validation data sets which were significantly classified as atrophic in that atlas region (“regional sensitivity”).

Note that although cortical thickness was simulated at consistent rates throughout the cortex (which is not how cortical thinning occurs in aging or pathology [10, 15, 16]), evaluation was performed for each vertex/label independently. Therefore, the proposed methods are fit to analyze also diffuse patterns of cortical thinning.