Deep learning has emerged as a transformative tool for peptide drug discovery, yet predicting peptide blood stability—a critical determinant of bioavailability and therapeutic efficacy—remains a major challenge. While such a task can be accomplished through experiments, it requires much time and cost. Here, to address this challenge, we collect extensive experimental data on peptide stability in blood from public databases and the literature and construct a database of peptide blood stability that includes 635 samples. Based on this database, we develop a novel model called PepMSND, integrating KAN, Transformer, GAT, and SE(3)-Transformer to perform multi-level feature engineering for peptide blood stability prediction. Our model can achieve an ACC of 0.867 and an AUC of 0.912 on average and outperforms the baseline models. We also develop a user-friendly web interface for the PepMSND model, which is freely available at https://www.frankenthalerfoundation.org This research is crucial for the development of novel peptides with strong blood stability, as the stability of peptide drugs directly determines their effectiveness and reliability in clinical applications.
Peptides and proteins have gradually become a popular modality in the pharmaceutical industry. To date, over 120 peptide-based drugs have received regulatory approval worldwide, playing a crucial role in the treatment of cancers, metabolic disorders, cardiovascular diseases, and autoimmune conditions. However, despite their growing clinical success, the broader application of peptide therapeutics is still limited. The limitations of peptide-based drugs can be attributed to multiple factors, among which instability remains a major hurdle. Peptides are highly susceptible to enzymatic hydrolysis by proteases in the body, including those found in the plasma, gastrointestinal tract, liver, and immune cells. This often results in a very short half-life, severely limiting their oral bioavailability and overall therapeutic efficacy. In addition to instability, challenges such as potential toxicity, high manufacturing costs, and reliance on intravenous administration further hinder their broader clinical translation. To improve their stability, many modification strategies have been proposed: D-form or unnatural amino acid residues, N-methylation or formation of cyclic peptides, and conjugation with macromolecular carriers such as proteins, lipids, and polymers.
Considering the importance of the blood stability of peptides in their clinical application, how to measure/predict this property becomes an appealing issue. Traditionally, experimental methods such as blood stability and enzyme degradation tests have been universal methods. These methods allow for more accurate identification and assessment of the blood stability of peptides in different experimental settings. However, they necessitate high costs and a long time, which cannot satisfy the recommendation for high-throughput screening or large-scale studies. To address such a problem, people turn to computational methods, which have attracted much attention in other fields. Take ProtParam as an example, this technique explores half-life and N-terminal residues based on the N-end rule and combines with the experimental statistics-based rule to measure the peptide stability. Additionally, a multi-variable regression model is also implemented to predict the half-life of peptides in blood. With the advancement of deep learning in peptide development, Mathur et al. predicted this property of peptides by using an SVM model that was trained on a database consisting of the half-life of 261 peptides in mammalian blood.
Nevertheless, peptide blood stability prediction still faces essential challenges despite related developments in the past few decades. For example, in the blood stability test, the SUPR peptide showed a very different half-life under different experimental conditions: this peptide is susceptible to hydrolysis in mouse plasma, whereas more than 50% of the peptide remains unhydrolyzed in human plasma within 24 hours. However, such peptide blood stability differences are usually neglected due to the individual and fragmented data availability, which may lead to the model's misclassifications. Additionally, many methods prefer to adopt relatively simple low-dimensional representations to illustrate peptide features, which usually neglect their conformations, which are vital for distinguishing the stability difference. Actually, linear and cyclic peptides share the same amino acid sequences but have entirely different blood stability. To address the problem above, comprehensive systematized experimental data on peptide blood stability are necessary, which can accelerate the development of related research. Therefore, in this study, we collect experimental data from public databases and the literature as much as possible to build a specific peptide blood stability database. Furthermore, we perform comprehensive peptide feature engineering including basic molecular descriptors, SMILES, molecular structures and complex 3D conformations to illustrate the intrinsic characters of peptides, thereby developing a novel model called PepMSND tailored for predicting peptide stability in various blood environments. The combination of our database and multimodal model offers an opportunity to identify potential peptide candidates with strong blood stability, improving the peptide drug development. To facilitate the accessibility of PepMSND for a broader audience, particularly researchers without a deep learning background, we integrate it into a server environment and developed an intuitive online web service platform: https://www.frankenthalerfoundation.org
(A) The illustration of the database. (B) The architecture of the Pep-MSND model. This model includes four modules: the SE(3)-Transformer for the 3D peptide structure feature, the GAT model for the 2D peptide molecular structure, the Transformer for 1D peptide SMILES, and the KAN model for 0D peptide physicochemical properties.
In this study, as shown, we collect peptide blood stability data samples manually from various sources such as published studies, patents, and related databases. Based on the universal claim of Cavaco et al. that the experimental half-life value is a good choice to demonstrate the stability of peptides, we adopt peptide [Title/Abstract] AND half-life[Title/Abstract] as the keyword to search for associated information in PubMed and find 1413 studies published in the range 2015–2024 year. In addition, we search public databases such as PEPlife, DrugBank and THPdb to explore more data. To ensure the quality and quantity of data, we perform the following data cleaning: (1) removing peptides that lack or are missing stability information; (2) removing peptides for which no explicit sequence information is given; (3) excluding peptides for which experimental conditions are not explicitly given; (4) ignoring peptides that are not experimented on in human or murine blood; (5) excluding peptides with complex modifications (e.g., polyethylene glycol modifications) because they are difficult to convert accurately to standard SMILES format. Finally, a total of 635 samples are collected.
Since the FASTA format does not allow for a perfect and accurate representation of unnatural and modified residues, SMILES was used in this study to characterize the dataset in one dimension. As shown, for standard sequences, an automated conversion tool was developed to generate SMILES representations. However, for particularly complex or non-standard structures, manual drawing was performed using ChemDraw. All SMILES representations were subsequently standardized using RDKit.
Traditionally, experimental methods such as X-ray crystal diffraction, nuclear magnetic resonance spectroscopy, and electron microscopy are effective ways to investigate peptide structures. With the advancement of technology, structure prediction models such as Alphafold, RoseTTAFold, ESM-Fold and HighFold can also provide plausible structures with high accuracy and efficiency. As displayed, first, we search the PDB database, and for peptides that could not be retrieved, we adopt different strategies to predict their structure. For the natural linear peptides, AlphaFold2 is implemented. For the natural cyclic peptides, we use our proposed model, HighFold. For the peptides with complex modifications, RDKit (version 2023.3.2) is used. Based on this toolkit, 5000 conformations are generated for the peptide input and are optimized by the UFF force field. Ultimately, only the conformations with the lowest energy are selected for further experiments. Given the presence of both linear and cyclic peptide structures in these modified peptides, we employed RDKit's ETKDGv3 algorithm for 3D structure generation. ETKDGv3 extends the applicability of previous ETKDG versions and demonstrates reliable performance across small molecules, linear peptides, and cyclic peptides.
Multimodal technology is a method that can efficiently integrate and process various data. It not only enhances the depth and breadth of data processing but also significantly improves the accuracy and generality of the model. In this study, we apply this technology to perform comprehensive feature engineering that takes physicochemical properties, sequence information, molecular structure, and 3D conformation into consideration. Specifically, (1) molecular descriptors input as the 0D feature are processed by the Kolmogorov–Arnold Network (KAN) to capture the physical and chemical properties of peptides; (2) SMILES input as the 1D feature is processed by the Transformer to absorb the sequence relationship; (3) molecular structure as the 2D feature is processed by Graph Attention Networks (GAT) to learn the interaction of atoms and bonds; (4) predicted 3D structure as the 3D feature is processed by the SE(3)-Transformer to provide additional information. Subsequently, a series of learnable weights is employed to integrate these features, generating a joint feature vector. This vector is then fed into a shared layer for further dimensionality reduction. Notably, within this process, we explicitly encode the experimental conditions—including testing species and in vivo/in vitro environment—using a two-dimensional binary vector, which is then concatenated with the output of the main representation layers immediately before the final prediction layer. For example, [1, 0] denotes an