Improved prediction of anti-angiogenic peptides based on machine learning models and comprehensive features from peptide sequences

Amino acid and dipeptide composition analysesThe AAC evaluation of AAPs and non-AAPs for S212 is illustrated in Fig. S1. The evaluation reveals that residues of C, P, R, S, and W are predominantly present in AAPs, whereas residues of A, E, I, L, and V are extra generally current in non-AAPs. The DPC evaluation for S212 is proven in Fig. S2. Certain dipeptides like CG, CN, CS, HG, HH, SP, and SC are predominant in AAPs, whereas dipeptides reminiscent of AA, EL, EV, IA, and NK are extra frequent in non-AAPs. The analyses present that the presence of sure amino acids and their synergistic interactions play a pivotal function in modulating the angiogenic properties of peptides.Selected characteristic subsetsCross validation outcomes of S212 and NT-S160 based on completely different characteristic numbers had been evaluated with MCC, and the outcomes are proven in Fig. 2. It might be seen the very best characteristic numbers of S212 and NT-S160 are 150 and 120, respectively. Using the 2 chosen characteristic numbers, the very best MCC achieved by the ML models are 0.683 and 0.651 for S212 and NT-S160, respectively. The complete lists of 150 features for S212 and 120 features for NT-S160 are proven in Supplementary Tables S2 and S3, respectively. The Gini significance produced by the random forest mannequin can also be listed for comparability. The features of S212 and NT-S160 are related to 34 and 32 characteristic varieties, respectively, amongst which 28 characteristic varieties are in frequent. Table 1 lists the highest 5 most frequent characteristic varieties for S212 and NT-S160. All of them are among the many 28 frequent characteristic varieties, regardless of their variations within the quantity of chosen features for every characteristic sort. The complete lists of characteristic varieties for S212 and NT-S160 are proven in Table S4.Figure 2MCCs f the very best performing ML models based on completely different characteristic numbers on (A) S212 and (B) NT-S160. The highest MCC in every panel is circled in purple.Table 1 The chosen numbers and sizes of the highest 5 most frequent characteristic varieties from the chosen characteristic subset utilizing S212 and NT-S160.Feature exploration and organic relevanceThe chosen characteristic subset of S212 (Supplementary Table S2) consists of features concerning to sure amino acids, together with Ala, Cys, Ser, Trp, Leu, and Phe. For instance, amino acid composition (AAC) for Ala, Cys, and Ser, distance distribution of residues43 (DDR) for Ala, Cys, and Trp, and Shannon entropy at residue stage (SER) for Ala, Cys, and Ser. This is in good settlement to the propensity proven in Supplementary Fig. S1, specifically, Cys, Ser, and Trp are prevalent in AAPs, whereas Ala, Leu, and Phe are extra frequent in non-AAPs. There can also be a powerful affiliation between hydrophobicity and the bulk of the chosen features of CTD. The scenario is constant to the established incontrovertible fact that AAPs have a comparatively excessive incidence of hydrophobic residues44. The presence of aliphatic residues seems to be a major attribute in CKSAAGP and GDPC. This remark aligns with the truth that aliphatic amino acids reminiscent of Ala, Ile, Leu, and Val are inclined to happen extra often in AAPs, as indicated in Supplementary Fig. S1. In addition to the above, the characteristic subset additionally incorporates generalized characteristic varieties, reminiscent of Ez45, Z346, and Z547. Ez particularly characterizes the empirical residue-based potential for protein insertion in lipid membranes, a course of ruled by advanced components reminiscent of hydrophobic interactions, electrostatic forces, and hydrogen bonding. Meanwhile, Z3 and Z5 are multidimensional descriptors capturing hydrophobicity, cost, aromaticity, polarity, and different physicochemical properties important to peptide habits and operate.It might be noticed from Supplementary Table S3 that in comparison with S212, NT-S160 reveals a considerably smaller quantity of chosen features throughout AAC, GDPC, and CKSAAGP, with only one, 1, and 4 features chosen, respectively, in distinction to three, 4, and 19 features chosen for S212. These characteristic varieties, that are derived from particular amino acids or their mixtures, are doubtless influenced by the shorter sequence lengths current in NT-S160. Supplementary Table S3 reveals that hydrophobicity stays a predominant attribute among the many chosen features of the CTD descriptor for NT-S160. Additionally, the characteristic varieties Ez, Z3, and Z5, which seize varied physicochemical properties, nonetheless contribute considerably to the chosen features, accounting for 14, 9, and 21 features, respectively.Benchmark outcomes of cross validationTable 2 reveals the benchmark outcomes of cross validation utilizing the six completely different ML models on S212. It might be seen SVM, attaining an MCC of 0.642, outperforms the opposite models in all of the analysis measures. SVM outperforms CB, the second-best mannequin, by 8.7%, 3.7%, and 4.2% in MCC, AUC, and accuracy, respectively. The precision of SVM is 0.840, representing an enchancment over the opposite models by 3.6% to 9.9%. On the opposite hand, SVM generates a recall of 0.825, which is larger than different strategies by a major margin ranging from 7.6 to 11.5%, suggesting that recall performs a extra important function than precision in explaining the SVM’s substantial enchancment in MCC on S212. Table 3 reveals the benchmark outcomes of cross validation on NT-S160. SVM achieves an MCC of 0.598 and generates the very best worth in all analysis measures. The MCCs of ET, RF, and CB are above 0.5. Similarly, SVM outperforms the opposite models by a spread of 5.0% to 16.7% in recall, and by a spread of 2.9% to 9.6% in precision, suggesting that the improved recall (or sensitivity) is the foremost purpose why SVM yields the very best MCC on NT-S160. The ROC curves of the six ML models for S212 and NT-S160 are illustrated in Supplementary Fig. S3A,B, respectively.
Table 2 Benchmark outcomes of cross validation on S212.Table 3 Benchmark outcomes of cross validation on NT-S160.Benchmark outcomes of impartial examsTable 4 reveals the benchmark outcomes of S56 with the six ML models and three current predictors, AAPred-CNN, TargetAntiAngio, and AntiAngioPred. It might be noticed SVM yields the very best recall, accuracy, and MCC. SVM improves the MCC and recall of AAPred-CNN, the state-of-the-art methodology, by 5.3% and 17.8%, respectively, although AAPred-CNN generates the very best precision of 0.815. The AUC of SVM, 0.828, is similar to the very best AUC of 0.830 produced by TargetAntiAngio. It can be seen that AntiAngioPred yields the bottom MCC and recall, indicating a big quantity of false negatives. The scenario may be very doubtless attributable to the truth that AntiAngioPred depends on a single characteristic sort, amino acid composition, for prediction. The ROC curves of the six ML models for S56 are illustrated in Supplementary Fig. S4A.
Table 4 Benchmark outcomes of impartial check on S56.The analysis outcomes on NT-S40 are proven in Table 5. In consistence with earlier research, the general MCCs for all strategies are improved in comparison with the outcomes of S56, the dataset of complete peptide sequences. SVM generates an MCC of 0.756, which is 5.9%, 19.6%, and 24.6% larger than AAPred-CNN, TargetAntiAngio, and AntiAngioPred, respectively. Notably, LightGBM, RF, ET, and CB additionally outperform the three current strategies in MCC. AAPred-CNN produces a precision and specificity of 1.000 however suffers from the second lowest recall of 0.650, indicating a big quantity of false negatives. In distinction, TargetAntiAngio produces the very best recall of 0.905 however suffers from the bottom precision amongst all of the strategies, indicating a big quantity of false positives. Nevertheless, the above benchmark comparisons display that our SVM mannequin produces a major enchancment in MCC over current strategies. The ROC curves of the six ML models for NT-S40 are illustrated in Supplementary Fig. S4B.
Table 5 Benchmark outcomes of impartial check on NT-S40.We additional analyzed the correlation between the prediction likelihood output for every mannequin and the true optimistic rate (TPR), calculated by the quantity of precise AAPs divided by the entire quantity of sequences predicted inside the vary of the prediction likelihood. As illustrated in Fig. 3A,B, all six ML models display robust optimistic correlations between the true optimistic charge and the prediction likelihood on each datasets. In different phrases, the next prediction likelihood results in the next probability that the sequence is an precise AAP.Figure 3True optimistic rate (TPR) and sequence quantity (denoted as SeqNum within the determine) with respect to prediction likelihood of 6 ML models on (A) S56 and (B) NT-S40. Prediction likelihood for every sequence is obtained from the output of every machine learning mannequin. True optimistic charge is calculated because the quantity of AAPs divided by the entire quantity of sequences predicted inside the vary of the prediction likelihood.Prediction accuracy with respect to peptide propertiesPrediction outcomes from impartial exams based on SVM, probably the most correct mannequin, had been additional analyzed with respect to 3 completely different peptide properties, specifically, the ratios of hydrophobic, hydrophilic, and charged residues inside a peptide. In this examine, hydrophobic amino acids are V, I, L, M, F, W, and C; hydrophilic amino acids are R, N, D, E, Q, H, Ok, S, and T; charged amino acids are E, D, R, Ok, H. As illustrated in Fig. 4, the prediction accuracy is positively correlated with the ratio of hydrophobic residues inside a peptide for S56 (Fig. 4A) and NT-S40 (Fig. 4D). On the opposite hand, the prediction accuracy is negatively correlated with the ratio of hydrophilic residues in a peptide (Fig. 4B for S56 and Fig. 4E for NT-S40) and the ratio of charged residues in a peptide (Fig. 4C for S56 and Fig. 4F for NT-S40). These outcomes counsel peptides of extra hydrophobic residues and fewer hydrophilic and charged residues are extra precisely predicted. These are in settlement with prior research which state that the hydrophobicity of a peptide is a crucial attribute of AAP16,44. The analyses level in direction of potential areas for future refinement, particularly in bettering predictions for peptides with fewer hydrophobic residues and the next proportion of hydrophilic and charged residues.Figure 4Analyses of prediction accuracy versus completely different properties of peptides from impartial exams. Curves in panels A, B, and C symbolize imply accuracy of peptides with respect to peptides from S56 with completely different ratios of hydrophobic, hydrophilic, and charged residues. Curves in panels D, E, and F are outlined analogously for peptides from NT-S40. Bars in every panel symbolize the quantity of peptides inside the ratios specified by x-axis.Efficacy of the chosen characteristic subsetIn this examine, every sequence is initially encoded with varied compositional, physicochemical, and organic features, resulting in a characteristic vector of 4335 numeric values. The chosen characteristic subsets, consisting of 150 and 120 numeric values for S212 and NT-S160, respectively, play an essential function within the enhanced prediction accuracy. To validate the efficacy of the characteristic subsets to the discrimination of TTCAs, we utilized t-distributed stochastic neighbor embedding (t-SNE)48,49 to visualise knowledge distributions on a two-dimensional airplane. As illustrated in Fig. 5, the t-SNE distributions of negatives and positives have severe overlap utilizing all 4335 numeric features for S212 (Fig. 5A) and NT-S160 (Fig. 5C). Conversely, the t-SNE distributions of negatives and positives are extra separated utilizing the chosen characteristic subsets for S212 (Fig. 5B) and NT-S160 (Fig. 5D). Take Fig. 5B for example, the positives are extra concentrated within the higher left and higher proper areas, whereas the negatives are extra concentrated within the decrease left area. The state of affairs reveals that the characteristic subsets for the 2 datasets are informative and useful to the improved prediction performances of ML models.Figure 5 t-SNE distributions of (A) S212 utilizing 4335 numeric features, (B) S212 utilizing the 150 chosen features, (C) NT-S160 utilizing 4335 numeric features, and (D) NT-S160 utilizing the 120 chosen features. Negatives discuss with non-AAPs and positives discuss with AAPs.

https://www.nature.com/articles/s41598-024-65062-9

Recommended For You