University of Birmingham > Talks@bham > Artificial Intelligence and Natural Computation seminars > One-shot Learning of Poisson Distributions - Information Theory of Audic-Claverie Statistic for Analysing cDNA Arrays

One-shot Learning of Poisson Distributions - Information Theory of Audic-Claverie Statistic for Analysing cDNA Arrays

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Per Kristian Lehre.

It is of utmost importance for biologists to be able to analyse patterns of expression levels of selected genes in different tissues possibly obtained under different conditions or treatment regimes. Even subtle changes in gene expression levels can be indicators of biologically crucial processes such as cell differentiation and cell specialisation. Measurement of gene expression levels can be performed either via hybridisation to microarrays, or by counting gene tags (signatures) using e.g. Serial Analysis of Gene Expression (SAGE) or Massively Parallel Signature Sequencing (MPSS) methodologies.

The SAGE procedure results in a library of short sequence tags, each representing an expressed gene. The key assumption is that every mRNA copy in the tissue has the same chance of ending up as a tag in the library. Selecting a specific tag from the pool of transcripts can be approximately considered as sampling with replacement. The key step in many SAGE studies is identification of `interesting’ genes, typically those that are differentially expressed under different conditions/treatments. This is done by comparing the number of specific tags found in the two SAGE libraries corresponding to different conditions or treatments.

Audic and Claverie were among the first to systematically study the influence of random fluctuations and sampling size on the reliability of digital expression profile data. For a transcript representing a small fraction of the library and a large number N of clones, the probability of observing x tags of the same gene will be well-approximated by the Poisson distribution parametrised by its mean (and variance) m>0, where the unknown parameter m signifies the number of transcripts of the given type (tag) per N clones in the cDNA library.

When comparing two libraries, it is assumed that under the null hypothesis of not differentially expressed genes the tag count x in one library comes from the same underlying Poisson distribution as the tag count y in the other library. However, each SAGE library represents a single measurement only! From a purely statistical standpoint resolving this issue is potentially quite problematic. One can be excused for being rather sceptical about how much can actually be learned about the underlying unknown Poisson distribution from a single observation.

The key instrument of the Audic-Claverie approach is a distribution P over tag counts y in one library informed by the tag count x in the other library, under the null hypothesis that the tag counts are generated from the same but unknown Poisson distribution. P is obtained by Bayesian averaging (infinite mixture) of all possible Poisson distributions with mixing proportions equal to the posteriors (given x) under the flat prior over m.

We ask: Given that the tag count samples from SAGE libraries are extremely limited, how useful actually is the Audic-Claverie methodology? We rigorously analyse the A-C statistic P that forms a backbone of the methodology and represents our knowledge of the underlying tag generating process based on one observation.

We show will that the A-C statistic P and the underlying Poisson distribution of the tag counts share the same mode structure. Moreover, the K-L divergence from the true unknown Poisson distribution to the A-C statistic is minimised when the A-C statistic is conditioned on the mode of the Poisson distribution. Most importantly (and perhaps rather surprisingly), the expectation of this K-L divergence never exceeds 1/2 bit! This constitutes a rigorous quantitative argument, extending the previous empirical Monte Carlo studies, that supports the wide spread use of Audic-Claverie method, even though by their very nature, the SAGE libraries represent very sparse samples.

Full paper: http://www.biomedcentral.com/1471-2105/10/310/

This talk is part of the Artificial Intelligence and Natural Computation seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Talks@bham, University of Birmingham. Contact Us | Help and Documentation | Privacy and Publicity.
talks@bham is based on talks.cam from the University of Cambridge.