Accurate Phylogenetic Classification of DNA Fragments Based on Sequence Composition

Metagenome studies have retrieved vast amounts of sequence out of a variety of environments, leading to novel discoveries and great insights into the uncultured microbial world. Except for very simple communities, diversity makes sequence assembly and analysis a very challenging problem. To understand the structure and function of microbial communities, a taxonomic characterization of the 5 obtained sequence fragments is highly desirable, yet currently limited mostly to those sequences that contain phylogenetic marker genes. We show that for clades at the rank of domain down to genus, sequence composition allows the very accurate phylogenetic characterization of genomic sequence. We developed a composition-based classifier, PhyloPythia, for de novo phylogenetic sequence 10 characterization and have trained it on a data set of 340 genomes. By extensive evaluation experiments we show that the method is accurate across all taxonomic ranks considered, even for sequences that originate from novel organisms and are as short as 1kb. Application to two metagenome datasets obtained from samples of phosphorus-removing sludge showed that the method allows the 15 accurate classification at genus level of most sequence fragments from the dominant populations, while at the same time correctly characterizing even parts of the samples at higher taxonomic levels.

By: Alice C. McHardy; Hector Garcia Martin; Aristotelis Tsirigos; Philip Hugenholtz; Isidore Rigoutsos

Published in: RC23930 in 2006


