Please use this identifier to cite or link to this item: https://doi.org/10.1186/s12859-021-04084-w
Title: A multi-task CNN learning model for taxonomic assignment of human viruses
Authors: Ma, Haoran 
Tan, Tin Wee 
Ban, Kenneth Hon Kim 
Keywords: Convolutional neural network
Deep learning
Genomic coverage
Naïve Bayesian network
Taxonomic assignment
Issue Date: 1-Jun-2021
Publisher: BioMed Central Ltd
Citation: Ma, Haoran, Tan, Tin Wee, Ban, Kenneth Hon Kim (2021-06-01). A multi-task CNN learning model for taxonomic assignment of human viruses. BMC Bioinformatics 22 : 194. ScholarBank@NUS Repository. https://doi.org/10.1186/s12859-021-04084-w
Rights: Attribution 4.0 International
Abstract: Background: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. Results: In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. Conclusions: We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus. © 2021, The Author(s).
Source Title: BMC Bioinformatics
URI: https://scholarbank.nus.edu.sg/handle/10635/232076
ISSN: 1471-2105
DOI: 10.1186/s12859-021-04084-w
Rights: Attribution 4.0 International
Appears in Collections:Staff Publications
Elements

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
10_1186_s12859-021-04084-w.pdf1.58 MBAdobe PDF

OPEN

NoneView/Download

SCOPUSTM   
Citations

1
checked on Dec 7, 2022

Page view(s)

10
checked on Dec 1, 2022

Google ScholarTM

Check

Altmetric


This item is licensed under a Creative Commons License Creative Commons