IDENTIFICATION OF HEMAGGLUTININ USING SEQUENCE INFORMATION MACHINE LEARNING LIGHTGBM

Rahu  Sikander; Narmeen Zakaria  Bawany; Ali  Ghulam; Mujeebu  Rehman; Rasulov  Farruhbek

doi:10.62019/jj354693

Authors

Rahu Sikander partment of Computer Science & Software Engineering Jinnah University for Women, Sindh, Pakistan Author
Narmeen Zakaria Bawany Department of Computer Science & Software Engineering Jinnah University for Women, Sindh, Pakistan Author
Ali Ghulam Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan Author
Mujeebu Rehman School of Information and Communication Engineering, Guilin University of Electronic Technology, Guilin, China Author
Rasulov Farruhbek Department of Pharmaceutical Sciences, Andijan State Medical Institute, Andijan, Uzbekistan Author

DOI:

https://doi.org/10.62019/jj354693

Keywords:

Feature engineering, Sequence analysis, DDE-PSSM, LightGBM, Model accuracy

Abstract

Introduction: HA contributes to viral infection by making it possible for the virus to fuse with the membrane of the host cell. Since HA is essential in influenza virus infection, pharmaceutical companies are focused on making drugs and vaccines against it. For this reason, it is very important to accurately identify HA for the progress of vaccination treatment. Even so, the complete identification of HA using computational methods is not enough. The purpose of this study is to build a model that helps find HA.

Methods: For this study, a benchmark dataset containing 106 HA and 106 non-HA sequences was obtained from UniProt. The samples were generated with various sequence-related properties. We developed an ensemble classifier through stacking technique by tuning features including stacking up of four machine learning (ML) methods.

Results and discussion: The accuracy of the model was found to be 97.80% in 5-fold cross-validation, whereas the area under the receiver operating characteristic (ROC) curve was 0.9930. The accuracy of the model in the test dataset was 93.18%, while the area of the ROC curve was 0.9793. Using DDE-PSSM, the LightGBM algorithm was able to achieve an accuracy of 97.13%, precision of 100.0%, sensitivity of 94.29%, specificity of 100.0%, MCC of 94.55% and an AUC of 98.30%. The accuracy, precision, sensitivity, specificity, MCC and AUC values for using anti-hypertensives were 97.80%, 100.0%, 94.10%, 100.0%, 94.51% and 99.30%, respectively. The model is presented as a particularly good predictor. The model is very useful for biochemists to explore for studying HA.

Downloads

Download data is not yet available.

IDENTIFICATION OF HEMAGGLUTININ USING SEQUENCE INFORMATION MACHINE LEARNING LIGHTGBM

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

How to Cite

Similar Articles

Make a Submission

Latest publications

Information

Language

visitors

Journal Information