Sometimes there is more than 1 TFBS matrix per TF in the TRANSFAC database utilized by SynoR. For example, there are 5 TRANSFAC matrices for the SP1 TFBS (SP1_01, SP1_Q6, SP1_Q6_01, SP1_Q4_01, SP1_Q2_01). In such cases, the SP1 2 input specification will be searching for a pair of ANY SP1 TFBS in the module, while the SP1_01 specification will be selecting for the SP1_01 matrix specifically. All the matrices were optimized independently, thus there is no straightforward rule of thumb on what is the best module definition in the case of multiple matrices for a TFBS. An experimentation with all the different options might be an optimal solution.
Select the Fix the order of TFBS option to require a specific order of TFBS in the cluster. For example, the following cluster definition will search for the GATA4 .. HNF4 .. SP1 configuration (or a reverse SP1 .. HNF4 .. GATA4 configuration), but not the HNF4 .. GATA4 .. SP1:
Also, if you select directionality of one of the TFBS, it will be reversed if the cluster is detected in the reverse strand. So, the selection of GATA4 + .. HNF4 - .. SP1 - is equivalent to SP1 + .. HNF4 + .. GATA4 -, but is not the same as SP1 - .. HNF4 - .. GATA4 +, if the Fix the order of TFBS option is selected.
- full list of the identified modules including:
a) genome position, which is linked to the ECR Browser. Following the ECR Browser link it is possible to study interspecies conservation of the corresponding genomic locus, extract DNA sequences, list all the neighboring evolutionary conserved regions (ECRs), visualize conserved TFBS in alignments of different genomes, identify cross-species synteny, detect conserved SNPs in ECRs, obtained detailed information on genes, etc.
b) annotation based on overlapping or bracketing gene features (promoter, UTR, etc.)
c) corresponding gene name(s) (annotated using "UCSC known gene annotation") [The corresponding gene name is the name of an overlapping gene (in case of coding, intronic, and UTR elements). It is the name of the nearest gene in case of promoters. The names of two bracketing genes are reported in case of an intergenic element.]
d) multi-species conservation profile
e) FASTA module sequence
f) position and strand of each TFBS in the module (available in the "text" output only)
- summary statistics on different types of modules
- functional annotation of genes corresponding (bracketing) to noncoding modules:
a) enrichment in GO categories
b) tissue specificity of the genes as calculated using the GNF Expression Atlas 2 (Su AI et al., PNAS (2002) 99, 4465-4470)
Follow this link for an example output corresponding to a SRF/SP1 SynoR scan through human/mouse cTFBS.
1. Details on the Gene Ontology analysis:
- Enrichment in GO categories is calculated for genes bracketing noncoding elements using binomial distribution approximation to the hypergeometric distribution.
- Holm's sequential Bonferroni correction is applied to account for multiple testing.
- GO analysis is performed for all the GO categories that include at least 10 genes from over 18,000 total "UCSC known genes".
- Significantly enriched GO categories (as indicated by the p-value of less than 0.05) are reported.
- Category name provides a dynamic link to the list of identified genes that fall into that particular category.
- The enrichment column gives a direct ratio of observed vs expected genes.
2. Details on tissue specificity analysis: Synor collects expression data from the GNF Atlas 2 for the identified genes corresponding to noncoding modules and presents it in a microarray-style table consisting of colored rectangles. The density of red and green colors correlates with the level of relative tissue expression of a particular gene. Brighter colors correspond to higher expression levels. SynoR tissue expression analysis normalizes expression across different tissues for each gene separately. This way, the maximum (positive or negative) gene expression is equivalent for all the genes and the difference between different genes corresponds to only the differences in gene expression across different tissues.
At the first data analysis step, the clustering of gene expression is performed by genes and tissues using the Cluster 3.0 software with the default settings. It allows a direct visual identification of clusters of co-expressed genes. Subsequently, the list of tissues with an unexpectedly large number of overexpressed or suppressed genes is extracted, providing tissue specificity estimates for the identified genes. This list is further broken into four categories: (1) significantly overexpressed, (2) some overexpressed, (3) some suppressed, and (4) significantly suppressed - that are described on top of the clustering figure using solid red, light red, light green, and solid green colors, respectively. The same colors are used to highlight these tissues in the clustering figure.