A Conference Postcard from PSB

January 21, 2011 Cecy Marden Publishing

PLoS Computational Biology was delighted to receive a Conference Postcard from the Pacific Symposium on Biocomputing (PSB) earlier this week. Conference Postcards is an experiment in novel ways to communicate the highlights of a conference to those who were not able to attend. The first Postcard, by A. Murat Eren, a Ph.D student at the University of New Orleans, is below. If you attended PSB and would like to submit a Conference Postcard this weekend please read the guidelines. We look forward to hearing from you.

John Bunge on “Estimating the Number of Species with Catchall” in the Microbiome Studies session.

Microbial ecology, the relatively young and flourishing juncture of ecology and microbiology, had its own session at the Pacific Symposium on Biocomputing 2011 for the first time: “Microbiome Studies: Understanding how the dominant form of life affects us”. It hosted an introductory tutorial, paper presentations and a prolific discussion session.

Microbial beings predominate life on this planet both in terms of abundance and diversity. The recent developments in massively parallel high-throughput sequencing technologies made 16S ribosomal RNA gene tag-based relative abundance and phylogenetic studies feasible and this, in turn, helped scientists explore the diversity of microbial communities deeper. With more understanding of their dynamics, we eventually will better explain reciprocal and correlative interactions between them and their environments.

Nevertheless, assessment of microbial diversity in a given environment is a cumbersome task as it confronts researchers with a fundamental problem: sampling bias. Mostly due to the vast scaling differences involved with sampling, reliable and applicable solutions to measure how well a sample represents a community’s true diversity is almost impossible to develop. However, microbial ecologists still have to rely on their samples to speculate about the diversity of their original communities and this requires heavy use of computational statistics and bioinformatics.

There are several widely used non-parametric and computationally lightweight diversity estimators that rely on abundance data, such as Chao1 and ACE, but they are known to be prone to skewed results when working with very high diversity situations where rare members create a long tail in the frequency count distribution curve of a sample. To address one of the major requisites of microbial ecology, biologists need more sophisticated and still computationally efficient quantitative approaches that can provide better accuracy on long-tailed microbial samples to estimate diversity of their originating community.

That is why I believe the method and the software package that was presented in PSB 2011 by John Bunge, “Estimating the Number of Species with CatchAll”, was an exciting improvement towards the right direction.

CatchAll is a software package that aims to find the optimal finite-mixture of models with the best parameters in order to realistically explain the distribution of operational taxonomic units in a sample, so that the actual diversity of the parent population could be computed by extrapolating the final estimation. Result of the analysis with CatchAll is a list of estimation recommendations along with confidence intervals, goodness-of-fit estimations and standard errors for researchers to investigate and select. What makes CatchAll promising is the fact that it is the first application to carry out parametric species richness estimation by efficiently combining statistics with heuristics, rather than only using a single coverage-based nonparametric richness estimation method for approximation.

In his presentation, Bunge benchmarked performance of CatchAll with a large data set from The International Census of Marine Microbes (ICoMM, http://icomm.mbl.edu) and demonstrated its encouraging results. When I asked David Mark Welch from Josephine Bay Paul Center of Marine Biological Laboratory about CatchAll, he said it is already being used by people writing up ICoMM summaries and it is going to be a part of VAMPS (http://vamps.mbl.edu) very soon.

CatchAll can be downloaded from http://www.northeastern.edu/catchall/ and run in all mainstream operating system environments. It also is a part of MOTHUR (http://www.mothur.org/), and soon will be available within QIIME (http://qiime.sourceforge.net/).

The epilogue of Bunge’s presentation was one of the possible future directions of microbial ecology research put into words: How are we going to incorporate estimation of unseen diversity into analyses of identities of organisms across populations? All statistical methods (including CatchAll) will, at best, let us study our samples and estimate how much diversity we are missing. However as Bunge has pointed out in his talk, estimating ‘how many more there are’ is only the first step. Estimating ‘who’ might be there and guessing ‘who’ might be missing from our samples would definitely be a game changer and is a challenge to both statisticians and computer scientists.

Discussion session of Microbiome Studies took place after the paper presentations and was directed by James A. Foster from the University of Idaho. Microbial ecologists, including invited speaker Rob Knight from the University of Colorado, Jack Gilbert from Argonne National Laboratory and Thomas G. Doak from Indiana university, not only answered questions from scientists in other fields but also discussed and listed issues of microbial ecology that are in need of attention.

During the discussion session two major challenges involved in diversity assessment efforts were portrayed: (1) sequencing errors that are introduced by the sequencing methods and (2) the difficulty of separating noise from the actual rare members of an underlying population. These hurdles undoubtedly lead to mere approximations to the diversity in an environment instead of a factual representation of it. It was noted that even though CatchAll substantially improves the accuracy of the statistical robustness of the diversity estimation process, caveats introduced by the limits of 16S rRNA gene and today’s high-throughput sequencing methods should always be considered.

One general suggestion that emerged from this discussion session was to focus on the functional role of the tail that represents rare individuals of microbial communities. It is intuitive to focus on dominant members of assemblages, but rare members might have an unanticipated impact on the functional diversity of their communities.

Another consensus emerging from the discussion session was the importance of defining higher order interactions of microbial populations with their human hosts. Vast amount of sequence data and meta-information is available as accessible online repositories. This allows researchers to develop and test hypotheses on minimum core sets of microbes that define diseases. Modeling the compositional complexity of microbial populations will definitely demand a serious amount of effort and time. Nevertheless, acknowledging this necessity and intriguing computer scientists and statisticians to solve this puzzle might be the first step.

All conference material and electronic copy of the proceedings book can be obtained from http://psb.stanford.edu/.

A. Murat Eren
Department of Computer Science
University of New Orleans