IDENTIFICATION OF HEMAGGLUTININ USING SEQUENCE INFORMATION MACHINE LEARNING LIGHTGBM
DOI:
https://doi.org/10.62019/jj354693Keywords:
Feature engineering, Sequence analysis, DDE-PSSM, LightGBM, Model accuracyAbstract
Introduction: HA contributes to viral infection by making it possible for the virus to fuse with the membrane of the host cell. Since HA is essential in influenza virus infection, pharmaceutical companies are focused on making drugs and vaccines against it. For this reason, it is very important to accurately identify HA for the progress of vaccination treatment. Even so, the complete identification of HA using computational methods is not enough. The purpose of this study is to build a model that helps find HA.
Methods: For this study, a benchmark dataset containing 106 HA and 106 non-HA sequences was obtained from UniProt. The samples were generated with various sequence-related properties. We developed an ensemble classifier through stacking technique by tuning features including stacking up of four machine learning (ML) methods.
Results and discussion: The accuracy of the model was found to be 97.80% in 5-fold cross-validation, whereas the area under the receiver operating characteristic (ROC) curve was 0.9930. The accuracy of the model in the test dataset was 93.18%, while the area of the ROC curve was 0.9793. Using DDE-PSSM, the LightGBM algorithm was able to achieve an accuracy of 97.13%, precision of 100.0%, sensitivity of 94.29%, specificity of 100.0%, MCC of 94.55% and an AUC of 98.30%. The accuracy, precision, sensitivity, specificity, MCC and AUC values for using anti-hypertensives were 97.80%, 100.0%, 94.10%, 100.0%, 94.51% and 99.30%, respectively. The model is presented as a particularly good predictor. The model is very useful for biochemists to explore for studying HA.