We have developed methods for finding associations among the heterogeneous data types in TCGA data. This includes the construction of a feature matrix: a large, heterogeneous matrix which combines virtually all available information regarding patients and samples for a given tumor type. The feature matrix is created by parsing and standardizing both public and protected TCGA data available through the DCC: clinical, mRNA (gene) expression, DNA methylation, microRNA expression, copy number variation, somatic (DNA) mutation data, and RPPA (protein) data.

Our Center has also incorporated other sources of information from the Genome Characterization Centers and other Genome Data Analysis Centers within TCGA. This mixed-type feature matrix includes numerical data (both continuous and discrete) and arbitrary unordered categorical data, while also allowing for missing values, a critical factor when working with biomedical data.

Typical matrices include 20,000 to 50,000 features describing 200 to 1000 tumor samples, and provide a starting-point for all of our downstream analyses, as well as a simple, standardized format for data-sharing between collaborators. From the feature matrix, we derive statistically significant Pairwise Associations, and multivariate associations through Random Forest analysis. Pairwise Association analysis has been performed systematically for every tumor analysis working group where the Center has been a participant.

Multi-Scale Association Explorer (MSAE)

One of the key applications within Regulome Explorer, the MSAE enables users to search, filter, and visualize analytical results generated from TCGA data. Associations are primarily displayed within the context of genomic coordinates. However, other views may also be used to evaluate associations, including graphs and tables. Two dimensional distributions of feature pairs (identified by association analysis), are also provided for further investigation.

Regulome Explorer a) The set of feature associations is filtered according to user-specified parameters. b) The circular layout displays the associations as edges in the Center connecting the features (with genomic coordinates) displayed around the perimeter. The outer ring displays cytogenetic bands. The inner ring displays associations that contain features lacking genomic coordinates. c) Sub-chromosome scale associations are explored with the use of a linear browser. d,e) A scale-independent view of the results is presented as a data table and a network. f) The association window, a two dimensional plot of the feature pair, is rendered in accordance with the specific feature types.

Colorectal Cancer Aggressiveness Explorer

The CRC Aggressiveness Explorer allows the exploration of molecular signatures associated with aggressive CRC, as described in Comprehensive Molecular Characterization of Human Colon and Rectal Tumors (manuscript in press). A molecular signature can be one of a variety of types: a change in the transcription level of a protein-coding gene or a microRNA, a somatic mutation, a somatic copy number alteration, or the change in DNA methylation near a gene promoter. Each signature has a score indicating the statistical significance of the evidence for its association with tumor aggressiveness. The score is a composite of individual association scores for tumor stage, the fraction of positive lymph nodes in the vicinity of the tumor, histological type (mucinous or non-mucinous carcinoma), and for the presence or absence of vascular invasion, lymphatic invasion, and distant metastasis. A positive score implies that the signature is more prevalent in tumors with aggressive colorectal cancer, while a negative score indicates the opposite. In the CRC Aggressiveness Explorer, these are shown in red and blue respectively, with a color gradient for the strength of association.

Colorectal Cancer Aggressiveness Explorer

The impact of somatic mutations are assessed by including protein domain-level binarization as features in pairwise statistical tests of heterogeneous TCGA data. Somatic mutation data is converted into protein domain information via a pipeline which incorporates the software tool ANNOVAR (reference: Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010).

Several features are generated for each gene depending on the type and sequence position of somatic mutations for each tumor sample in the data set. Synonymous, missense, nonsense, and frameshift mutation types are considered. Protein domains including any of these mutation types are annotated as such, with nonsense and frameshift annotations being propagated to all subsequent protein domains. Example associations identified by pairwise statistical tests between these binary somatic mutation annotations and other data types include mutual exclusivity and co-occurrence of genomic events, subtype or other phenotype-associated mutations, and significant changes in gene and miRNA expression, all of which can be viewed within Regulome Explorer.