Gene expression profiling
In the field of molecular biology, gene expression profiling is the measurement of the activity (the expression) of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.
Several transcriptomics technologies can be used to generate the necessary data to analyse. DNA microarrays[1] measure the relative activity of previously identified target genes. Sequence based techniques, like RNA-Seq, provide information on the sequences of genes in addition to their expression level.
Background
Expression profiling is a logical next step after
Expression profiling experiments often involve measuring the relative amount of mRNA expressed in two or more experimental conditions. This is because altered levels of a specific sequence of mRNA suggest a changed need for the protein coded by the mRNA, perhaps indicating a
Comparison to proteomics
The human genome contains on the order of 20,000 genes which work in concert to produce roughly 1,000,000 distinct proteins. This is due to
Use in hypothesis generation and testing
Sometimes, a scientist already has an idea of what is going on, a hypothesis, and he or she performs an expression profiling experiment with the idea of potentially disproving this hypothesis. In other words, the scientist is making a specific prediction about levels of expression that could turn out to be false.
More commonly, expression profiling takes place before enough is known about how genes interact with experimental conditions for a testable hypothesis to exist. With no hypothesis, there is nothing to disprove, but expression profiling can help to identify a candidate hypothesis for future experiments. Most early expression profiling experiments, and many current ones, have this form The figure above represents the output of a two dimensional cluster, in which similar samples (rows, above) and similar gene probes (columns) were organized so that they would lie close together. The simplest form of class discovery would be to list all the genes that changed by more than a certain amount between two experimental conditions.
Class prediction is more difficult than class discovery, but it allows one to answer questions of direct clinical significance such as, given this profile, what is the probability that this patient will respond to this drug? This requires many examples of profiles that responded and did not respond, as well as cross-validation techniques to discriminate between them.
Limitations
In general, expression profiling studies report those genes that showed statistically significant differences under changed experimental conditions. This is typically a small fraction of the genome for several reasons. First, different cells and tissues express a subset of genes as a direct consequence of
The relatively short length of gene lists published from expression profiling experiments limits the extent to which experiments performed in different laboratories appear to agree. Placing expression profiling results in a publicly accessible microarray database makes it possible for researchers to assess expression patterns beyond the scope of published results, perhaps identifying similarity with their own work.
Validation of high throughput measurements
Both
Statistical analysis
Data analysis of microarrays has become an area of intense research.[10] Simply stating that a group of genes were regulated by at least twofold, once a common practice, lacks a solid statistical footing. With five or fewer replicates in each group, typical for microarrays, a single outlier observation can create an apparent difference greater than two-fold. In addition, arbitrarily setting the bar at two-fold is not biologically sound, as it eliminates from consideration many genes with obvious biological significance.
Rather than identify differentially expressed genes using a fold change cutoff, one can use a variety of
Selecting a different test usually identifies a different list of significant genes
As the number of replicate measurements in a microarray experiment increases, various statistical approaches yield increasingly similar results, but lack of concordance between different statistical methods makes array results appear less trustworthy. The MAQC Project[15] makes recommendations to guide researchers in selecting more standard methods (e.g. using p-value and fold-change together for selecting the differentially expressed genes) so that experiments performed in different laboratories will agree better.
Different from the analysis on differentially expressed individual genes, another type of analysis focuses on differential expression or perturbation of pre-defined gene sets and is called gene set analysis. which tests the significance of gene sets based on permutation of gene labels or a parametric distribution.
Gene annotation
While the statistics may identify which gene products change under experimental conditions, making biological sense of expression profiling rests on knowing which protein each gene product makes and what function this protein performs. Gene annotation provides functional and other information, for example the location of each gene within a particular chromosome. Some functional annotations are more reliable than others; some are absent. Gene annotation databases change regularly, and various databases refer to the same protein by different names, reflecting a changing understanding of protein function. Use of standardized gene nomenclature helps address the naming aspect of the problem, but exact matching of transcripts to genes[18][19] remains an important consideration.
Categorizing regulated genes
Having identified some set of regulated genes, the next step in expression profiling involves looking for patterns within the regulated set. Do the proteins made from these genes perform similar functions? Are they chemically similar? Do they reside in similar parts of the cell?
Genes have other attributes beside biological function, chemical properties and cellular location. One can compose sets of genes based on proximity to other genes, association with a disease, and relationships with drugs or toxins. The Molecular Signatures Database[20] and the Comparative Toxicogenomics Database[21] are examples of resources to categorize genes in numerous ways.
Finding patterns among regulated genes
Regulated genes are categorized in terms of what they are and what they do, important relationships between genes may emerge.[23] For example, we might see evidence that a certain gene creates a protein to make an enzyme that activates a protein to turn on a second gene on our list. This second gene may be a transcription factor that regulates yet another gene from our list. Observing these links we may begin to suspect that they represent much more than chance associations in the results, and that they are all on our list because of an underlying biological process. On the other hand, it could be that if one selected genes at random, one might find many that seem to have something in common. In this sense, we need rigorous statistical procedures to test whether the emerging biological themes is significant or not. That is where gene set analysis[16][17] comes in.
Cause and effect relationships
Fairly straightforward statistics provide estimates of whether associations between genes on lists are greater than what one would expect by chance. These statistics are interesting, even if they represent a substantial oversimplification of what is really going on. Here is an example. Suppose there are 10,000 genes in an experiment, only 50 (0.5%) of which play a known role in making cholesterol. The experiment identifies 200 regulated genes. Of those, 40 (20%) turn out to be on a list of cholesterol genes as well. Based on the overall prevalence of the cholesterol genes (0.5%) one expects an average of 1 cholesterol gene for every 200 regulated genes, that is, 0.005 times 200. This expectation is an average, so one expects to see more than one some of the time. The question becomes how often we would see 40 instead of 1 due to pure chance.
According to the hypergeometric distribution, one would expect to try about 10^57 times (10 followed by 56 zeroes) before picking 39 or more of the cholesterol genes from a pool of 10,000 by drawing 200 genes at random. Whether one pays much attention to how infinitesimally small the probability of observing this by chance is, one would conclude that the regulated gene list is enriched[24] in genes with a known cholesterol association.
One might further hypothesize that the experimental treatment regulates cholesterol, because the treatment seems to selectively regulate genes associated with cholesterol. While this may be true, there are a number of reasons why making this a firm conclusion based on enrichment alone represents an unwarranted leap of faith. One previously mentioned issue has to do with the observation that gene regulation may have no direct impact on protein regulation: even if the proteins coded for by these genes do nothing other than make cholesterol, showing that their mRNA is altered does not directly tell us what is happening at the protein level. It is quite possible that the amount of these cholesterol-related proteins remains constant under the experimental conditions. Second, even if protein levels do change, perhaps there is always enough of them around to make cholesterol as fast as it can be possibly made, that is, another protein, not on our list, is the
Bearing the foregoing caveats in mind, while gene profiles do not in themselves prove causal relationships between treatments and biological effects, they do offer unique biological insights that would often be very difficult to arrive at in other ways.
Using patterns to find regulated genes
As described above, one can identify significantly regulated genes first and then find patterns by comparing the list of significant genes to sets of genes known to share certain associations. One can also work the problem in reverse order. Here is a very simple example. Suppose there are 40 genes associated with a known process, for example, a predisposition to diabetes. Looking at two groups of expression profiles, one for mice fed a high carbohydrate diet and one for mice fed a low carbohydrate diet, one observes that all 40 diabetes genes are expressed at a higher level in the high carbohydrate group than the low carbohydrate group. Regardless of whether any of these genes would have made it to a list of significantly altered genes, observing all 40 up, and none down appears unlikely to be the result of pure chance: flipping 40 heads in a row is predicted to occur about one time in a trillion attempts using a fair coin.
For a type of cell, the group of genes whose combined expression pattern is uniquely characteristic to a given condition constitutes the gene signature of this condition. Ideally, the gene signature can be used to select a group of patients at a specific state of a disease with accuracy that facilitates selection of treatments.[25][26] Gene Set Enrichment Analysis (GSEA)[16] and similar methods[17] take advantage of this kind of logic but uses more sophisticated statistics, because component genes in real processes display more complex behavior than simply moving up or down as a group, and the amount the genes move up and down is meaningful, not just the direction. In any case, these statistics measure how different the behavior of some small set of genes is compared to genes not in that small set.
GSEA uses a
Conclusions
Expression profiling provides new information about what genes do under various conditions. Overall, microarray technology produces reliable expression profiles.[28] From this information one can generate new hypotheses about biology or test existing ones. However, the size and complexity of these experiments often results in a wide variety of possible interpretations. In many cases, analyzing expression profiling results takes far more effort than performing the initial experiments.
Most researchers use multiple statistical methods and exploratory data analysis before publishing their expression profiling results, coordinating their efforts with a
See also
- Gene expression profiling in cancer
- Spatiotemporal gene expression
- Transcriptomics
- Splice variant analysis
References
- ^ "Microarrays Factsheet". Retrieved 2007-12-28.
- PMID 15123278.
- PMID 17935276.
- S2CID 40896577.
- PMID 18162499.
- PMID 24143002.
- PMID 17465711.
- ^ van Dongen, Stijn (2000). Graph Clustering by Flow Simulation. University of Utrecht.
- PMID 24564555.
- PMID 17233564.
- ^ "Significance Analysis of Microarrays". Retrieved 2007-12-27.
- PMID 17370338.
- S2CID 1857997.
- PMID 18048398.
- ^ Dr. Leming Shi, National Center for Toxicological Research. "MicroArray Quality Control (MAQC) Project". U.S. Food and Drug Administration. Retrieved 2007-12-26.
- ^ PMID 16199517.
- ^ PMID 19473525.
- PMID 16284200.
- PMID 17448222.
- ^ "GSEA – MSigDB". Retrieved 2008-01-03.
- ^ "CTD: The Comparative Toxicogenomics Database". Retrieved 2008-01-03.
- ^ "Ingenuity Systems". Retrieved 2007-12-27.
- PMID 19439102.
- PMID 15950303.
- PMID 17878518.
- PMID 19377049.
- ^ "GSEA". Retrieved 2008-01-09.
- S2CID 58528299.