IDENTIFICATION OF HEMAGGLUTININ USING SEQUENCE INFORMATION MACHINE LEARNING LIGHTGBM

Authors

  • Rahu Sikander partment of Computer Science & Software Engineering Jinnah University for Women, Sindh, Pakistan Author
  • Narmeen Zakaria Bawany Department of Computer Science & Software Engineering Jinnah University for Women, Sindh, Pakistan Author
  • Ali Ghulam Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan Author
  • Mujeebu Rehman School of Information and Communication Engineering, Guilin University of Electronic Technology, Guilin, China Author
  • Rasulov Farruhbek Department of Pharmaceutical Sciences, Andijan State Medical Institute, Andijan, Uzbekistan Author

DOI:

https://doi.org/10.62019/jj354693

Keywords:

Feature engineering, Sequence analysis, DDE-PSSM, LightGBM, Model accuracy

Abstract

Introduction: HA contributes to viral infection by making it possible for the virus to fuse with the membrane of the host cell. Since HA is essential in influenza virus infection, pharmaceutical companies are focused on making drugs and vaccines against it. For this reason, it is very important to accurately identify HA for the progress of vaccination treatment. Even so, the complete identification of HA using computational methods is not enough. The purpose of this study is to build a model that helps find HA.

Methods: For this study, a benchmark dataset containing 106 HA and 106 non-HA sequences was obtained from UniProt. The samples were generated with various sequence-related properties. We developed an ensemble classifier through stacking technique by tuning features including stacking up of four machine learning (ML) methods.

Results and discussion: The accuracy of the model was found to be 97.80% in 5-fold cross-validation, whereas the area under the receiver operating characteristic (ROC) curve was 0.9930. The accuracy of the model in the test dataset was 93.18%, while the area of the ROC curve was 0.9793. Using DDE-PSSM, the LightGBM algorithm was able to achieve an accuracy of 97.13%, precision of 100.0%, sensitivity of 94.29%, specificity of 100.0%, MCC of 94.55% and an AUC of 98.30%. The accuracy, precision, sensitivity, specificity, MCC and AUC values for using anti-hypertensives were 97.80%, 100.0%, 94.10%, 100.0%, 94.51% and 99.30%, respectively. The model is presented as a particularly good predictor. The model is very useful for biochemists to explore for studying HA.

Downloads

Download data is not yet available.

Downloads

Published

2025-07-06

How to Cite

IDENTIFICATION OF HEMAGGLUTININ USING SEQUENCE INFORMATION MACHINE LEARNING LIGHTGBM. (2025). Journal of Medical & Health Sciences Review, 2(3). https://doi.org/10.62019/jj354693

Similar Articles

1-10 of 165

You may also start an advanced similarity search for this article.