In the present study, through a geographical epidemiological analysis, we observed that there are significative regional differences in the frequency of the two most common HLA haplotypes in the Italian population among the northern, central and southern regions with HLA-A*01:01g-B*08:01g-C*07:01g-DRB1*03:01g (ranked #1 at the national level) showing a decreasing frequency gradient and HLA-A*02:01g-B*18:01g-C*07:01g-DRB1*11:04g (ranked #2) an increasing frequency gradient from North to South. The geographical distribution of these haplotypes overlaps with that of Covid-19 in Italy, being linearly correlated in a positive/direct way for the haplotype #1 and in a negative/inverse way for the haplotype #2. This means that a high incidence and mortality was observed in the northern regions where the population has high frequency values of the haplotype HLA-A*01:01g-B*08:01g-C*07:01g-DRB1*03:01g and all the allelic combinations of the four considered HLA-A, -B, -C, -DRB1 loci, containing at least one of these alleles, particularly those with the B*08:01g and DRB1*03:01g polymorphism, suggestive of potential ‘susceptibility’ to the disease. On the contrary, a low incidence and mortality for Covid-19 was observed in the central-southern regions with high frequency values of the haplotype HLA-A*02:01g-B*18:01g-C*07:01g-DRB1*11:04g and of its alleles B*18:01g, C*07:01g and DRB1*11:04g in all their possible combinations containing at least one of such alleles, suggestive of potential ‘protection’ from the infection. Hence, the population of central-southern Italy that shows the highest prevalence of the protective haplotype HLA-A*02:01g-B*18:01g-C*07:01g-DRB1*11:04g and its allelic combinations and, at the same time, the lowest frequencies of the disadvantageous haplotype HLA-A*01:01g-B*08:01g-C*07:01g-DRB1*03:01g and its allelic combinations, could be genetically shielded from Covid-19. Such findings are only descriptive in nature and need to be validated through retrospective observational case–control studies on Covid-19 patients typed for HLA comparing the frequencies of the potential ‘protective’ and ‘unfavourable’ HLA haplotypes and alleles highlighted in the general Italian population with those observed in the Covid-19 patients cohort, in order to define such HLA polymorphisms as a factor effectively associated to the disease susceptibility as already done for other viral infections, communicable diseases and autoimmune pathologies [15,16,17,18,19]. However, also in these pathologies such geographical epidemiological approaches have given important clues to identify sub-populations most at risk of susceptibility to the infection also taking into account as a susceptibility parameter HLA specific alleles and haplotypes [13].
To the best of our knowledge, this is the first study that estimated, through a population frequency analysis, the potential association of specific HLA alleles and haplotypes with the incidence and mortality of Covid-19. Although the primary scope of a bone marrow registry is to increase the possibilities to find allogenic compatible donors for transplants, it is also a unique source of precious HLA data from the widest and most representative sample available at the national level, which makes it possible to reliably estimate haplotypes frequencies in a given population and carry out association studies in many disease contexts. We conducted our study on a large sample of 104,135 subjects typed at high resolution four-digit level, subdivided in the 20 Italian regions, with a regional sample size adequately statistically representative of the resident population for each region [20].
Our study is the first to propose HLA as a susceptibility marker to SARS-CoV-2 infection and highlight its potential impact on the epidemic trend within a specific country, Italy, that has been hit particularly hard. However, similar associations may also be observed within other countries, bringing to light common genetic patterns or new country-specific protective or unfavourable HLA polymorphisms, that could explain some of the differences observed in the epidemic between one country and another. Such geographical epidemiological studies, conducted at the general population level, need to be confirmed in Covid-19 patients’ cohorts of asymptomatic, mildly symptomatic, severely affected individuals to draw fundamental conclusions with important implications not only at the epidemiological level but also at the clinical one. Indeed, particular HLA haplotypes/alleles could be associated with a stronger immune response and hence a better host response to the virus. Some useful information can also be inferred by previous researches on SARS and MERS, where it has been reported that several HLA polymorphisms are associated to SARS susceptibility (HLA-B*46:01, HLA-B*07:03, HLA-DRB1*12:02 and HLA-Cw*08:01) [25,26,27]. On the contrary the allelotypes HLA-DR*03:01, HLA-Cw*15:02 and HLA-A*02:01 seem to be protective from SARS infection [28]. HLA-DRB1*11:01 and HLA-DQB1*02:02 are related to MERS-CoV infection susceptibility [29]. On these premises, it is conceivable that several HLA associations could be unfavourable or protective also for the course of Covid-19 infection.
Very recent works employed different bioinformatic approaches to predict the best SARS-CoV-2 derived B and T cell epitopes and their associated HLA alleles, that may help to design effective vaccines and find protective antibodies [30,31,32,33,34,35]. Employing HLA binding affinity prediction tools, it has been observed that HLA-A and HLA-C alleles exhibited the relatively most and least capacity to present SARS-CoV-2 antigens, respectively. However, depending on the specific study and the bioinformatic approach used, the best and worse predicted presenters of conserved peptides reported are not the same. We found that the alleles analysed in our study are present in the database recently made available by Nguyen et al., that reports the list of 32,257 8- to 12-mers peptides from the SARS-CoV-2 proteome and their binding affinity to 145 different HLA A, B, C alleles, predicted by bioinformatic tools [30]. In particular, all the alleles pointed out in our study have been predicted to have an overall good capacity to present viral peptides, independently of their potential correlation with Covid-19 regional incidence and mortality, with HLA-A*02:01 being the best (1062 total peptides, 268 with a very high binding affinity < 50 nM), followed by HLA-B*08:01 (225 total, 25 high affinity), HLA-A*01:01 (183 total, 44 high affinity), HLA-B*18:01 (101 total, 12 high) and HLA-C*07:01 (44 total, 4 high) (Additional file 1: Fig. S1) [30].
It is important to note that all the bioinformatic predictions made on SARS-CoV-2 epitopes and their HLA binding, have the limit to be exclusively theoretical and thus need to be experimentally validated in in vitro binding assays and in the ability to effectively elicit T and B cell mediated responses. Indeed, it is widely recognised that antigenicity, immunogenicity and, for T cells, the TCR avidity to the antigen/HLA and hence the functional immune responses elicited, are not directly related with the peptide binding affinity [36, 37]. No information is available to date regarding the binding of HLA II molecules, whose polymorphic variants could play a relevant role in orchestrating a functional adaptive immune response.
Undoubtedly, the method of analysis used in our study presents some limits and could be affected by an inevitable selection bias, since it takes into consideration the region of birth of the typed individuals but not the region of residence, whereas data about Covid-19 infections are reported per region where the infection occurred, independently of birthplace. However, we can reasonably exclude the influence of migration flows (that in Italy are historically directed from the southern regions to the northern) on the regional frequencies used in our computations, since they are equivalent to those from previous studies with information concerning both the region of birth and residence and so, thanks to the large dimension of the regional subgroups analysed, independent from the migratory movements [38, 39]. The information about Covid-19 cases and deaths relies on public resources, daily updated on the basis of laboratory analysis of swabs tested positive for the virus by RT-PCR at the regional accredited centers, following confirmatory testing by the Italian National Institute of Health in Rome. As above reported, these values could have been underestimated for reasons depending on several factors like a stringent testing policy, limited to severely affected symptomatic individuals, that excluded from testing the bulk of asymptomatic ones, shortage of testing materials in the peak of the emergency, limited access to overcrowded hospital facilities, to name just a few. Noteworthy, a higher overall mortality rate than previous years has been observed in Nembro, a little town of Lombardia region, indicative of both direct and indirect disease burden and has been also highlighted by a recent report published by Italian National Institute for Social Security [5, 6].
Apart from the epidemiological value in tracing the distribution of Covid-19 and understanding its immunopathogenesis, the identification of specific HLA haplotypes as potential risk, susceptibility or protective biomarkers, can be of great help in stratifying the population, in order to identify those patients more at risk to develop a severe infection, thus allowing to adopt proper preventive strategies and early intervention measures.
It is important to note that the HLA region is known for its linkage disequilibrium, therefore, other genes very near to HLA could be eventually responsible for the association with Covid-19 regional distribution. Genetic polymorphisms in the HLA locus or in other genes encoding key components of the immune-inflammatory response observed in SARS-CoV-2 infection (KIR receptors, inflammasome components, cytokines and chemokines like CXCL10) may help to explain the high variable spectrum of disease manifestations, progression and outcome (from asymptomatic, to mild-moderately symptomatic and severely affected patients requiring intensive care and respiratory support).
With this in mind, even though the collected knowledge is still limited to few studies, some susceptibility markers other than HLA have been proposed for Covid-19. An association with ABO blood antigens has been observed in a cohort of Chinese patients, with the type A and 0 being respectively at highest and lowest risk to be infected, as previously been reported for other viral infections [40]. This observation was confirmed in a genomewide study on Spanish and Italian patients’ cohorts. Indeed, a skewing of ABO blood antigens distribution among Covid-19 patients who suffered from respiratory failure was reported, whereas no significant association was found between HLA polymorphisms in Covid-19 patients and respiratory failure (oxygen supplementation or mechanical ventilation) [41]. To the best of our knowledge this is the only study available to date that takes into account the association of HLA polymorphisms and Covid-19 severity, but it is important to note that it was performed in a limited Italian population, including only patients from Lombardia region, without taking into account geographical patterns of HLA distribution. Genetic polymorphisms of key genes of the virus entry machinery (Ace2, Tmprss2, CtsB, and CtsL) or of the inflammatory/immune response (e.g. cytokines and their receptors) or epigenetic mechanisms may influence virus susceptibility and the severity/outcome of the infection among different individuals and populations, too [42, 43]. A novel susceptibility locus containing a cluster of six genes (SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, and XCR1) on chromosome 3p21.31, most of whose involved in the regulation of inflammatory and immune response, has been indeed found [41].
We recognize that other factors, e.g. climatic differences, pollution, lockdown effect that limited the diffusion from North to South, could be responsible alone or in combination with genetic factors for the different Covid-19 infection rates among Italian regions. Our reported potential association of two haplotypes with the differential regional incidence and mortality for Covid-19 in Italy may explain, from the point of view of the genetic diversity of the Italian population, why the epidemic hit the northern regions so hard and instead had a small impact on those of the central-south, a figure which cannot be explained on the basis of population, urban density, movements to and from large urban and industrial areas, pollution or climate alone. Indeed, several central and southern metropolitan areas like those of Rome, Naples, Bari, Palermo (respectively located in Lazio, Campania, Puglia, Sicilia) have an urban density comparable or even higher (Naples) than Milan and Lombardia, atmospheric emissions of PM10, PM2.5 and NO2 levels above threshold, and high flows of mobility through public transports [44]. Furthermore, the climatic variations in Italy are very limited and not comparable to those occurring in wider countries like China, US or Brasil [45].
Our correlation analysis among HLA regional frequencies and Covid-19 cases/deaths numbers, having been carried out at different times over the epidemic, also takes into account the potential effects elicited by the displacement of thousands of off-site students and workers from the northern (mainly Lombardia, the fire of the epidemic) to the southern regions (Campania, Puglia, Calabria, Sicilia), which occurred in two large waves, the night before the start of the lockdown (the 9th of March, totally uncontrolled) and at the end of the lockdown (after the 3rd of May, with some monitoring from region to region). These uncontrolled exoduses and especially the first one, although occurring in a phase of mobility restrictions and contact reduction, could have caused the epidemic to break out in the southern Italian regions, which instead did not occur and which makes the hypothesis of a protective genetics even more plausible in the populations of central-southern Italy.
Genetic variations and HLA polymorphisms alone cannot help to understand other significant features of Covid-19, like the higher mortality observed in men vs women (2.8% vs 1.7%) or the higher morbidity and mortality in old vs young people [46,47,48]. However, it is fundamental to take into account that significant differences at the immunological level exist among these groups and such differences could be dependent on HLA polymorphisms and, overall, on the genetic, hormonal and metabolic background. Indeed, HLA genes are involved in the decline of anti-viral response mediated by T cells that is observed with aging.