Unsupervised ensemble learning for genome sequencing

作者:

Highlights:

• The variant calling step in next generation sequencing technologies is formulated as a classification problem.

• An unsupervised ensemble classification method is proposed as a variant caller for DNA sequencing.

• An EM-based variant calling algorithm that estimates the maximum a posteriori class to take a decision is presented.

• The number of classes to be decided is greater than the number of different labels that are observed.

• Experimental results with real human DNA sequencing data support the approach.

摘要

•The variant calling step in next generation sequencing technologies is formulated as a classification problem.•An unsupervised ensemble classification method is proposed as a variant caller for DNA sequencing.•An EM-based variant calling algorithm that estimates the maximum a posteriori class to take a decision is presented.•The number of classes to be decided is greater than the number of different labels that are observed.•Experimental results with real human DNA sequencing data support the approach.

论文关键词:Expectation maximization algorithm,Variant calling,Genome sequencing,Unsupervised multi-class ensemble classifier,GATK

论文评审过程:Received 8 August 2021, Revised 1 April 2022, Accepted 18 April 2022, Available online 19 April 2022, Version of Record 6 May 2022.

论文官网地址:https://doi.org/10.1016/j.patcog.2022.108721