DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools [METHODS]

Anupama Jha1,8, Stephanie C. Bohaczuk2,8, Yizi Mao2, Jane Ranchalis2, Benjamin J. Mallory1, Alan T. Min3, Morgan O. Hamm1, Elliott Swanson1, Danilo Dubocanin4, Connor Finkbeiner1, Tony Li1, Dale Whittington5, William Stafford Noble1,6, Andrew B. Stergachis1,2,7 and Mitchell R. Vollger2 1Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; 2Division of Medical Genetics, University of Washington, Seattle, Washington 98195, USA; 3Department of Statistics, University of Washington, Seattle, Washington 98195, USA; 4Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA; 5Department of Medicinal Chemistry, University of Washington, Seattle, Washington 98195, USA; 6Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA; 7Brotman Baty Institute for Precision Medicine, Seattle, Washington 98195, USA

8 These authors contributed equally to this work.

Corresponding authors: abstergauw.edu, mvollgeruw.edu Abstract

Long-read DNA sequencing has recently emerged as a powerful tool for studying both genetic and epigenetic architectures at single-molecule and single-nucleotide resolution. Long-read epigenetic studies encompass both the direct identification of native cytosine methylation and the identification of exogenously placed DNA N6-methyladenine (DNA-m6A). However, detecting DNA-m6A modifications using single-molecule sequencing, as well as coprocessing single-molecule genetic and epigenetic architectures, is limited by computational demands and a lack of supporting tools. Here, we introduce fibertools, a state-of-the-art toolkit that features a semisupervised convolutional neural network for fast and accurate identification of m6A-marked bases using Pacific Biosciences (PacBio) single-molecule long-read sequencing, as well as the coprocessing of long-read genetic and epigenetic data produced using either the PacBio or Oxford Nanopore Technologies (ONT) sequencing platforms. We demonstrate accurate DNA-m6A identification (>90% precision and recall) along >20 kb long DNA molecules with an ∼1000-fold improvement in speed. In addition, we demonstrate that fibertools can readily integrate genetic and epigenetic data at single-molecule resolution, including the seamless conversion between molecular and reference coordinate systems, allowing for accurate genetic and epigenetic analyses of long-read data within structurally and somatically variable genomic regions.

Received February 9, 2024. Accepted May 21, 2024.

Comments (0)

No login
gif