Analysis results on 93,930 SARS-CoV-2 sequences
We extracted 93,930 high-quality sequences (available on October 2, 2020) from the the EpiCov™ section of the GISAID portal [20] that acts as a worldwide repository of viral isolates. Every viral isolate contains zero or more variants with respect to the Wuhan strain. The fully annotated table of 20,640 variants is available in Additional file 1: Table S1.
Considering the total number of sequences, the most frequent Spike protein variation is confirmed to be D614G (Fig. 1b). Next, there are a few variants with a frequency of 2–7% that are slowly rising in the viral populations, namely S477N, Q613H and A222V. The S477N mutation lies in the RBD locus, while all the other RBD variants are below the 1% penetrance threshold (Fig. 1c, Additional file 1: Table S2).
We asked whether the RBD variants are strongly associated with a certain geographical region, to this purpose we propose to measure “mutations per thousand isolates”, that is
$$MpTI = \frac{M*1e3}{{I_{c} }}$$
with \(M\) as the absolute mutation count and \({I}_{c}\) the total isolates for that country.
This normalization smooths out the inter-country variability, but the resolution bias remains due to the high variability in isolate sequencing among countries, changing in order of magnitude from thousands to dozens. This bias will cause rare mutations to be hidden in countries with a few associated isolates, while the associations with highly frequent outliers will remain more robust.
The geographical distribution shows how the S477N variant is strongly rooted in Australia, and the N439K is associated with clusters starting from the United Kingdom (Scotland). However, none of the most frequent variants are uniquely associated with one country as expected from the worldwide virus distribution, and clusters of co-occurring variants lie mostly in countries with the highest number of available sequences (i.e., USA, England) (Additional file 1: Table S2, Fig. 2a, b). In order to better understand the evolution of variants over time and space, we tracked the location of all the source isolates carrying these two variations (Fig. 2c). The S477N has been firstly identified in Colombia and is harbored in more than 60% of the isolates sequenced in Australia from June 2020. On the other hand, the N439K is dominating the isolates landscape in Ireland and England from August 2020.
Immunogenomic analysis
These novel RBD variants may have several biological and putatively clinical impacts on the virus functions. For instance, every protein-coding variant changes several epitope sequences presented on the human cell’s surface from the Major Histocompatibility Complex (MHC). This, in turn, can have an impact on the human immune system recognition by T and B-cells. To shed light on these processes, we computationally modeled the immunological impact of SARS-CoV-2 epitopes in terms of (1) MHC Class 1 presentation of antigens (2) T-cell immunogenicity (3) B-cell epitope prediction.
Considering the only RBD mutations above 0.1% frequency, we performed MHC class 1 binding prediction for both the wild-type (Wuhan strain) and the mutated strain, choosing the most frequent HLAs in public databases [21]. The software generated predictions for all possible 9-mers resulting from the full RBD sequence, and we computed how binders were calculated for all the considered HLAs. Out of 18 predictions, only one epitope had binding affinity in mutated or wild-type, with no change in binding affinity caused by the mutation (Additional file 1: Table S3).
Mutation
|
WT
|
Mutant
|
Wild-Type binders
|
Mutated binders
|
---|
S477N
|
IYQAGSTPC
|
IYQAGNTPC
|
1
|
1
|
Then, we asked whether these mutations had a significant impact on T- and B-cell recognition, and we ranked them via the Immune Epitope Database and analysis resource (IEDB) [18]. When considering T-cells with the class 1 immunogenicity tool [22], 54/189 (29%) epitopes showed a negative, non-immunogenic score in both wild-type and mutated forms. When focusing on the predicted class 1 binders for the same epitope/HLA combination in mutated peptides, IYQAGNTPC (S477N) shows increased immunogenicity for all considered HLAs (Fig. 3a). These results point out to a putative variability in T-cell response mediated by these mutations.
Testing epitopes for putative B-cell recognition remains an analytical challenge; only a few algorithms have been developed for this purpose [23]. When focusing to all the mutated epitopes caused by S477N and N439K, none cause a shift of the amino acid exposition, as they are both predicted to be in the exposed status. The overall Epitope score, that takes into account a variable amino acid window surrounding the mutation, is slightly increased, from 0.535/0.516 in the WT sequence to 0.561 and 0.548 for S477N and N439K, respectively (Additional file 1: Table S5).
Covid-miner data portal
In order to share the results with a wider public, we created a frontend portal for the analysis results, freely accessible at https://covid-miner.ifo.gov.it. The web app features two sections; a Variant section is dedicated to browse and visualize the most frequent viral mutations over the genome, and a Geographical distribution heatmap that displays the association among countries and the most frequent RBD mutations. The home page displays the amount of wild type / mutated RBD variants as a main summarizing figure (Fig. 3b, c).