I thought I should do a post on strategies for handling genome-scale data from large experiments. Like many happy novices we rushed in and performed a couple of huge microarray experiments. Now we are trying to compare these against eachother and against strain-differences in in the form of SNPs. To say that this is not trivial is an understatement. In the end we have come up with a couple of strategies that might be useful to others as well. They might seem simple, but they will help you wade through your data, without getting stuck there for too long.
Don't look at untested data for anything but quality control. In most analysis programs it is very easy to just plot a gene by its intensity, either over all samples or by crude difference between group-averages. This is fun, and can give you some idea of what you are looking for. However, there are anything from 20k-30k genes in your dataset and you will not have time to look at all of them and make a reasoned decision. That means that if you pick genes this way you will indeed pick some hits, but not in a systematic way, and that leaves you open to errors. Either of picking a gene where the variation is too large, or that may only be the n+100th gene in your set and not the top one. The point is: You just don't know. The suggestion is to test your genes first, and then use the list of significantly differentially expressed genes to pick your top hits.
Consider doing a single comparison across several groups. If you have run a large experiment, with several groups (where several is anything from four to whatever). It is a chore to keep track of which genes were significant in which comparison, and even more to then pick a small number of genes to validate that are somehow representative across groups. So, try to pool your data by phenotype. If you have one group where something happens and lots of different controls. Just pool the controls and do a single comparison. That will give you just one list and make your selection process much easier.
When comparing experiments, use rankings. So, now you have a list, or several if pooling didn't make sense. Now you have to find your top candidates for validation. It is then helpful to use rankings. The easiest (and often quite accurate) way is to rank by average difference between groups (that would be your fold-change or log-ratio). If you have just one list, then you are done. If you have two, or several lists, then you can order your lists by rank-sum or rank-product. The rank-sum is a kind method that gives you some hits even if a gene is poorly-ranked (high number) in one of your groups. The rank product will pronounce those that are highly-ranked in all lists. Then you have the very particular rank-difference and rank-ratio. These may be very interesting if you have two experiments with wildly different phenotypes, but that you can not compare directly. If a gene is ranked in the top 10 in one experiment and 7815th in the other, it may be a very interesting gene and will then be ranked highly by the rank-difference or ratio.