Thursday, October 29, 2009

Don't log your data twice

I am sitting here with some RNA expression microarray data from an Illumina Beadarray experiment. In general the data consist of a long [23k genes] list of spot intensities for each array in the experiment. If you want to compare any two of these lists in R using Bioconductor and the beadarray package you can run a small piece of code like this:

arrays = 1:2,
labels = targets$sampleID[1:2])

[plotMAXY is a plotting function; exprs extracts expressions from the BeadSummaryData (BSData); arrays tells the function which arrays to compare and labels tells it what to call them]

Then you get a graph like this:

Figure 1: On the lower left you see a XY-plot, where the intensities of all genes on array C1 are plotted against the intensities on array C2. The dashed lines show a two fold change in expression between arrays. To the upper right is the MA-plot (M is the ratio of intensities on the two arrays plotted on the y-axis, and A is the average intensity of a gene plotted on the x-axis).

What should strike you is the fact that one of the arrays appears to have a generally higher intensity than the other and that the difference changes with intensity, giving a banana-shaped curve. Some very smart people have shown that most of this is just an artifact of the method and that the average intensity of all genes in comparable samples should be the same and not vary with the expression. So what you do is normalise the data of all arrays to the same quantile distribution.

BSData.log2.quantile = normaliseIllumina(BSData,
method = "quantile",
transform = "log2")

[BSData.log2.quantile is the output variable; normaliseIllumina is the function that normalises data from Illumina arrays using the method given by method and performs the transformation given by transform; BSData is the BeadSummaryData]

The result should be a graph such as:

Figure 2: plotMAXY result after normalisation, when all is well.

You can see that the average intensities of the two arrays are now equal so that the cloud lies around the unity line in the XY-plot and the 0-line in the MA-plot. There is no longer any banana shape to the curve. But, there are a numer of spots outside the two-fold lines showing differentially expressed genes.

What I got when I plotted the normalised data this time was this plot:

Figure 3: Example of a plotMAXY result that shouldn't exist.

It's like, D..N. It shouldn't look like that. S..T! It's impossible that there should be such a curve. The XY-plot shows one sample to be impossibly much stronger than the other, but only at high intensities. What the F..K are the fold change lines doing at that kind of angle any way? The MA curve at least looks like it should, but show no differently expressed genes at all. B....Y V......E! Even technical replicates aren't that similar.

When I can hear myself above the beating of my heart and can see the screen again through the torrent of cold sweat pouring into my eyes I look more closely at the XY-plot and see that the scales on the x- and y-axis are different. Actually no intensities are higher than 4.

What has happened is that the plotMAXY function in log2-transforms the data by default before plotting (that's why the 2-fold line is called 1), and I already did the log2-transformation when I quantile-normalised the data. What I should have done is added the "log = FALSE" line to the plotMAXY, like this:

arrays = 1:2,
labels = targets$sampleID[1:2],
log = FALSE)

[log = FALSE tells the function plotMAXY not to log2 transform the expression data from BSData.log2.quantile]

Now i get a graph more like figure 2 and can continue with my analysis without further risk of apoplexy. Or maybe not, data analysis is a forest fraught with formidable foes and fearsome formulas.

I leave you now to venture again into the jungle of statistical tricks.

Wish me luck,


No comments:

Post a Comment