Personalized antigen discovery, enabling the identification of peptides presented on a person’s cancer cells by class I and II human leukocyte antigen (HLA-I and HLA-II) and recognized by autologous T cells, is crucial for the development of cancer vaccines, adoptive transfer of T cells and various T cell-targeting molecules. Methods based on whole-genome sequencing (WGS) or whole-exome sequencing (WES) and RNA sequencing (RNAseq) for mutated antigen (neoantigen) predictions are commonly used for translational research and in clinical trials. Recently, the application of mass spectrometry (MS) to identify HLA-bound peptides, in combination with proteogenomics, facilitated the exploration of novel targets from a variety of antigens naturally processed and presented in cancer, including neoantigens, tumor-associated antigens and tumor-specific antigens (TAAs and TSAs), oncoviruses and peptides translated from putative non-protein-coding transcripts. However, their identification is laborious and current clinical pipelines do not support immunopeptidomics and are restricted to predicted neoantigens.
Immunotherapies are remarkably effective against some tumor indications; however, robust immune pressure may lead to immune editing, resulting in the selection of tumor cells displaying diminished antigenicity. Mutations, such as somatic single-nucleotide variants (SNVs), insertions, deletions and copy number (CN) variations (CNVs), can disable parts of the antigen processing and presentation machinery (APPM). These alterations affect components such as β2-microglobulin (B2M), transporters associated with presentation (TAP1 and TAP2) and the HLA locus, hindering immune recognition. It is, thus, essential to gain a comprehensive understanding of the heterogenous antigenic landscape and of the tumor’s capacity to present antigens.
In this study, we introduce an ‘end-to-end’ clinical antigen discovery proteogenomic pipeline called NeoDisc. NeoDisc is a compilation of publicly available and in-house software for the identification of immunogenic tumor-specific HLA-I and HLA-II antigens from genomics and transcriptomics and MS-based immunopeptidomics, as well as their prediction and prioritization with rule-based and machine-learning (ML) tools. It enables the assessment of both tumor heterogeneity and the functionality of the APPM. We compared NeoDisc’s performance with other available tools and demonstrated its application for personalized antigen discovery for translational cancer research and its clinical implementation for the design of cancer vaccines. Through deep investigation of tumors from seven persons with cancer of various indications, we exemplify how NeoDisc’s comprehensive multiomics data integration allows detection and prioritization of immunogenic TSAs from various sources. We also demonstrate how NeoDisc uncovers the heterogeneous antigenic landscape linked to defects in the APPM, which are crucial for translational research and the advancement of personalized cancer immunotherapy.
NeoDisc is a dedicated computational framework that combines genomic, transcriptomic and immunopeptidomic data for the prediction and direct identification of clinically relevant antigenic peptides presented exclusively on cancer cells derived from multiple sources, including mutations, TSAs, TAAs, oncoviral elements and noncanonical transcripts. The framework integrates curated public databases of known immunogenic TSAs and viral antigens with ML and rule-based models for the prioritization and selection of clinically relevant antigens.
NeoDisc uses matched tumor and germline genomic (WES or WGS) data for sample-specific variant characterization, tumor content estimation and the identification of CNVs and somatic mutations (SMs). While WGS data can be used in NeoDisc, they were not implemented in this study. Bulk RNAseq data are used for human gene and SM expression quantification and estimation of T cell inflammation. Unmapped RNAseq reads are further used for viral infection identification and viral gene expression quantification, primarily targeting oncoviruses. NeoDisc creates personalized proteome references, where single-nucleotide polymorphisms (SNPs), SMs and noncanonical expressed transcripts are annotated and used for downstream HLA binding and immunogenicity prediction of antigenic peptides. Immunopeptidomic data are searched against the personalized proteome to identify naturally presented antigenic peptides. HLA-I and HLA-II typing is derived from germline and tumor WES and from RNAseq data. Defects in the tumor APPM and HLA loss of heterozygosity (LOH) are identified and highlighted. An ML algorithm, specifically trained on a complex matrix of tens of features, prioritizes likely immunogenic HLA-I neoantigens (minimal epitopes) and longer mutated sequences and supports the design of RNA-based and peptide-based vaccines. Beyond HLA-I-restricted neoantigens, ranking of other classes of antigens exclusively expressed in the tumor, such as HLA-II neoantigens, TAAs and oncoviruses, is performed by rule-based approaches, adjusted on the basis of a decision scheme considering a limited set of features. NeoDisc supports data integration of multiple samples from the same person, providing insights into tumor heterogeneity and evolution.
NeoDisc runs on a Linux system and is packed in a Singularity container. The runtime was 15.78 ± 9.52 h (100 GB of random-access memory and 24 central processing units) for a participant’s dataset that included matched germline–tumor WES, tumor RNAseq and data-dependent and data-independent acquisition (DDA and DIA, respectively) of immunopeptidome MS data. The runtime is determined by the sample’s sequencing depth, immunopeptidome depth and mutational load. Lastly, NeoDisc generates processed genomic, transcriptomic and immunopeptidomic data, detailed characterization of the APPM, prioritized lists of TSAs and sample-specific reports, ensuring data traceability.
NeoDisc applies four variant-calling algorithms to WES and WGS data. Variants detected by only one caller are deemed to have low identification confidence, whereas those identified by two or more callers are considered as having high identification confidence. In default settings, low-confidence and high-confidence variants are included in the personalized proteome for immunopeptidomic searches, while only high-confidence variants are considered for in silico neoantigen predictions, ensuring a robust selection of predicted neoantigens. Highly mutated tumors tend to better respond to immunotherapy, likely because they present more neoantigens. Yet, the selection of immunogenic neoantigens among numerous possibilities is challenging. Initially, rule-based approaches were applied for the prioritization of HLA-I-restricted and HLA-II-restricted neoantigens and other antigenic peptides exclusively expressed in the tumor. Recently, large datasets of neoantigens in tumors of 112 participants (called the National Cancer Institute (NCI) cohort) were systematically screened by Gartner et al. for autologous T cell responses, allowing the training of ML tools for prioritization. Minigenes expressing almost all the called mutations were transcribed in vitro and transfected into autologous antigen-presenting cells (APCs) followed by a coculture with TIL cultures and interferon-γ (IFNγ) enzyme-linked immunospot (ELISpot) immunogenicity measurement. In most cases, additional immunogenicity screens were performed to identify the optimal epitopes and their HLA restrictions. Previously, we trained ML classifiers on a fraction of this dataset (NCI-train) and tested their performance on the remaining samples (NCI-test), as well as additional cohorts. These ML classifiers were