Technical problem – typical experiment produce tens of
millions of such positions over hundreds of millions to billions possible
locations (base pairs) in the genome!
Solutions:
Shrink/simplify the data so they are small enough for us to understand (e.g. peak calls, unsupervised machine learning)
Use data visualization to make an original comprehensible to us
Scientific data visualization – is it important?
Data visualization is prevalent approach in science
Shaded matrix display from Loua (1873).
Since the advent of sequencing techniques there is great advance in methods specific to this field
Helps us to better understand the data and find the patterns that might be lost due to shrinkage/simplification
Great for exploratory data analyses
Very useful for results presentation
We can visualize reads directly, but usually more useful is converting them to a read coverage
*-seq data visualization: multiple parts of genome, using pre-defined genomic features
Command line tools, e.g. ngsplot
Tools on Galaxy platform: deepTools, Cistrome, etc.
Why do we need yet another visualization tool?
Existing solutions did not meet our requirements:
Custom scripts and pargramic languages labraries allows to run things in batch,
but are too complicated to run for users without IT expertise
Even with good training these tools requires a lot of time to code
Galaxy/Cistrome was too slow and not configurable enough
(plus data privacy problem!)
I want take the best from two worlds - connect the intuitiveness and interactiveness of genome browsers with visualization power of plotting 1000s of genomic features at once.
Goal: fast, intuitive software for exploratory data analyses!
SeqPlots is this software!
We developed a highly configurable, GUI operated web application for rapidly generating sets of publication quality linear plots and heatmaps.
See SeqPlots in action on the movie...
Quick explanaton of the example in hand
Files - signal profiles from ChIP-seq experiments:
H3K4me3 (mark active promoters)
H3K36me3 (mark transcribed regions of active genes)
Files - genomic features:
C. elegans transcription start sites (TSS), divided into 5 expression bins
based on RNA-seq data
Tasks:
Compare histone marks between highly and lowly expressed genes.
Check if CpG (CG-dinucleotide) occupancy is higher on transcription start sites (TSS) relative to local neighborhood