Sunday, December 11, 2011

How many hits do you want?

Data mining is a huge business today and also an area of intense investigation. After an intense week of trying to compare gene-expressionmicroarray data from different experiments during my recent spout in Glasgow, I thought I would share some thoughts on analysis methods. Basically the choice comes down to the number of hits you might want, which in-turn is decided by how you intend to analyse and validate your results.

With modern high-throughput methods one of the main problems is choosing candidates for validation and further study. As long as you are just doing PCRs for validation, you can run quite a number, but proper validation will often require knock-out, and/or knock-in experiments, possibly both in-vivo and in-vitro. Picking the wrong candidate can cost you years of work. So, you had better be correct when you make your decision. This is a reason that it is common to filter high-throughput data by the size of change. Whether that is advisable depends on what you want to do with the data. Modern high-level data analysis often works much better if you have more to work with. Not to mention that many of the new methods of analysis are designed to look for systems of changes, each of which may be negligible, that together may cause significant physiological effects.

The calculation of a p-value comes in two steps. First you calculate a value and then you correct that value for multiple comparisons.
  • Anova: Mathematically elegant but computationally heavy and quite strict, producing not so many significant hits. It also depends on there being similar variations and similar number of samples per group.
  • Rank product: Basically ranks each gene in each sample multiply these across samples and compares the results to similar products from randomly generated datasets. Of all the heavy methods this is one of the heavier. i.e. long computations times.
  • Empirical Bayesian methods: I won't say I understand what exactly this does, but it is described as "reducing the standard error toward a common mean." This means that you are in the end comparing the estimates of the means of your samples, i.e. a variant of Student's T-test, but one that somehow takes the rest of your expression data into account. This is a quick method that is also well regarded.
  • T-test: Your favourite statistical test, seldom used by itself in microarrays, se SAM for more details.
  • SAM: Significance Analysis of Microarrays, using T-tests to test the contrast of interest and compare it to other permutations of the samples. If the contrast of interest shows a better significance (by a set level) than the random configurations then it is deemed significant. Depending on the number of permutations you use the analysis is more or less computationally heavy, but it does by necessity take quite a bit of processing.
  • Signal to noise ratio: You basically divide the the mean (difference in mean) by the standard deviation. This in itself does not produce a P-value, but it does provide a way to rank genes that can be used to pick the most probably changed genes. If you compare the list you get to lists produced by random permutations of your data, then you can produce proper p-values, but that makes it more onerous.
Once you have your p-values, they have to be corrected for the number of comparisons. The thing is that if you use a probability of chance difference cut-off of 5%, as is common, fully 1/20th of your significant genes are probably false-positives.

When correcting your p-values the first important distinction (that many are not aware of) is post-hoc versus à-priori. Mostly these relate to how to test samples after an ANOVA. Post-hoc testing means that you do not know which comparisons you want to make, so you just make them all. 
For example: You want to assess the effect of a treatment on two populations, say men and women. 
This gives rise to four groups: untreated-men, treated-men, untreated-women and treated-women. Or, you may even have before and after measurements in each of these, giving you eight groups. Between four groups there are six (x*(x-1)/2) possible comparisons, and between eight groups there are 28. If you correct your p-value for all these you may introduce a lot of false-negative results, i.e. expression changes that really are there, but that does not make your corrected cut-off. However, many of these comparisons may be nonsensical.

The difference between untreated-men-before-treatment to treated-women-after-treatment, could be one such example. It is difficult to interpret without knowing what the baseline for the women was. What you can do in this kind of situation is decide à-priori (eng. beforehand) to only run some of the comparisons. 
For example: You compare men to women before treatment, before and after treatment only in the same group, the change between before and after only within each sex, and the difference between the changes in the treated and untreated men to the same difference in women*.
This is only eight comparisons out of the possible 28, and some of these would not be included in the 28 at all, i.e. the differences. By cutting the number of comparisons you also cut the correction you have to make to your p-values. In addition, you can argue that the absolute differences are independent from the change caused by treatment and correct these comparisons separately, giving you only a four-fold correction to your p-value instead of a 28-fold. 

Anyway, that was only corrections between groups for one parameter. In a typical microarray experiment you will have 20000 parameters across your groups, so that the total number of comparisons would be 28*20000, or a lot.

Just multiplying your p-values by the number of comparisons was the original and most conservative correction. It is called Bonferroni, because that's the guy who described it. This is generally too conservative and will produce false-negatives even in experiments with a small number of comparisons. However, it is useful for narrowing the number of hits if you think you have too many, or if you need to be really certain of the hits you pick.

The family-wise error rate correction is what ANOVA uses across all samples and it is what post-hoc tests like Tukey's honest significant difference uses for post-hoc testing between individual samples. It is quite strict in large, relatively low signal-to-noise data, like microarrays, but quite relaxed in smaller data-sets. This is difficult to use in à-priori-type designs since these make the estimation of the family-wise variation difficult.

False discovery rate, or FDR, is an iterative correction across a ranked list. Basically, the first hit is uncorrected, the second somewhat corrected and so on. What it is actually saying is that; out of all the genes, from the top of the list to the point you are looking at, a given percentage will be false discoveries. So if you get 3000 with an FDR <0.05 you have 5% or 150 false positives in that sample. This is one of the least strict corrections. It is often used in microarray analysis for that very reason.

The final question is if you correct your FDR values by the number of group-wise comparisons you are doing. Or, maybe you are comparing similar, but independent, experiments, which you could argue strengthens your p-values by as much as p1*p2.

As they say: If it was easy, anyone could do it.

* in this kind of design you should really use a multi-way ANOVA instead of a plain one, and that choice often makes it obvious that you should use à-priori contrasts instead of a post-hoc.

No comments:

Post a Comment