Characterizing T cell epitope loss potential through peptidome surveillance across evolving SARS-CoV-2 lineages

Supplementary information and interactive content to accompany published analysis of HLA binding of the SARS-CoV-2 peptidome across viral variants, and the impact on potential loss of CD8+ and CD4+ T-cell epitopes.

Our pre-print can be found here:

Binder count fraction at pan-HLA hotspots in most frequent S,N of VOC lineages

Interactive plots of binder count fraction relative to reference (NCBI Reference Sequence: NC_045512) at pan-HLA hotspots.

Figure 10

Distribution of binder count fraction at pan-HLA hotspots for VOC and VOI lineages

Figure 9

N candidate epitope conservation in SARS-CoV-2 lineages with representative S protein ranking in the top percent for HLA binder loss

Figure 6

restriction worst case (most loss) protein most frequent protein
HLA-I Fig. 6A Fig. 6B
HLA-II Fig. 6C Fig. 6D
HLA-I & HLA-II Fig. 6E Fig. 6F

Binder count fraction for all HLAs across all unique versions of S and N

Note that, for each protein considered, binder count fraction (relative to the reference SARS-CoV-2 genome) is computed after summation of binder count across all pan-HLA hotspots. See Methods in paper for details.

Figure 5

11/26/2021 plot plot plot plot
09/25/2021 plot plot plot plot
07/27/2021 plot plot plot plot
05/11/2021 plot plot plot plot
03/19/2021 Fig. 5A Fig. 5B Fig. 5C Fig. 5D

HLA cluster assignment and representative set selection

Figure 13

The .csv files below include our cluster label assignment for all processed HLA-I and HLA-II. Each file contains a column explicitly indicating which HLAs were included in our analysis set (selected in HLA-I, and selected_ab in HLA-II).

For quick browsing, interactive plots are also included below.

Evaluation of Predictions

Empirically observed T-cell response frequencies (RF) align with aggregated HLA-binder peaks

Figure 2

Each of the sub-figures can be accessed as a separate interactive file below. Click on the legend to toggle which data to display.

HLA binding prediction benchmark details

To supplement the published overall mean ROC AUC and PPV of our RNN and CNN across HLAs, we include per-HLA results below:

For each HLA, the table summarizes the sample count (n_pos, n_neg), how many samples were classified as ambiguous by each system (RNN_n_ambig, CNN_n_ambig), and the fraction of the total sample count classified as ambiguous (RNN_n_ambig_fract, CNN_n_ambig_fract). On average, across all HLAs, only 2.2% of samples were excluded as ambiguous by both RNN and CNN systems, with HLAs showing the lowest levels of ambiguous sample counts being at 0.7%, and highest ambiguous sample counts at 6%.

As referenced in our paper, performance metrics of other systems reported for individual HLAs were obtained from Reynisson et al. 2020, Supplementary Table 8.