BEADS normalization algorithm for ChIP-seq experiments

The function runs BEAD algorithm on aligned files. It requires short read alignment in BAM format, input track (either single experiment BAM or summed input created using sumBAMinputs function), mappability track in BigWig format (see details) and reference genome in FASTA fomat or as BSgenome genomic package.

beads(experiment, control, mappability, genome, uniq = TRUE, insert = 200L,
  mapq_cutoff = 10L, export = "BEADS", rdata = FALSE, export_er = TRUE,
  quickMap = TRUE, ...)

# S4 method for BamFile,BamFile,BigWigFile,ANY
beads(experiment, control,
  mappability, genome, uniq = TRUE, insert = 200L, mapq_cutoff = 10L,
  export = "BEADS", rdata = FALSE, export_er = TRUE, quickMap = TRUE,
  ...)

# S4 method for BamFile,BigWigFile,BigWigFile,ANY
beads(experiment, control,
  mappability, genome, uniq = TRUE, insert = 200L, mapq_cutoff = 10L,
  export = "BEADS", rdata = FALSE, export_er = TRUE, quickMap = TRUE,
  ...)

# S4 method for character,character,character,character
beads(experiment, control,
  mappability, genome, uniq = TRUE, insert = 200L, mapq_cutoff = 10L,
  export = "BEADS", rdata = FALSE, export_er = TRUE, quickMap = TRUE,
  ...)

Arguments

experiment	The path to experiment (sample) alignment file in BAM format or `BamFile` class.
control	The path to control (input) alignment file in BAM format or summed input file in BigWiggle format (accepts `BamFile` and `BigWigFile` classes as well).
mappability	The path to mappability track in BigWiggle format (accepts `BigWigFile` class as well).
genome	The path reference genome FASTA or UCSC identifier for installed `BSgenome` packages e.g. "hg19" for human
uniq	If TRUE the alignment will be uniqued, i.e. only one of non-unique reads will be used.
insert	The expected insert size in base pairs.
mapq_cutoff	The cutoff parameter used to filter BAM alignments for low mapping quality reads.
export	The character vector of BW tracks to be exported.
rdata	If TRUE all data will be exported as R binaries in addition to BigWiggle tracks.
export_er	If TRUE the enriched regions will be exported to BED format.
quickMap	If TRUE the quick mappability processing be used, otherwise the mappability track will be processed by running mean smoothing.
...	parameters passed by reference to rbeads internal functions.

Value

Named BigWigFileList containing exported BW file connections.

Details

Mappability/alignability tracks gives numeric score for level of reference sequence uniqueness. Short reads cannot be confidently aligned to non-unique sequences, so BEADS masks the out. The pre-calculated tracks for many species can be found in genome databases, e.g. following link gives the collection of human tracks for reference genome GRCh37/hg19: http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=340327143&g=wgEncodeMapability.

For other species mappability track can be easily calculated from reference FASTA file using GEM-mappability software (http://www.ncbi.nlm.nih.gov/pubmed/22276185). The GEM library binaries are available on SourceForge: (http://sourceforge.net/projects/gemlibrary/files/gem-library/), while the reference genome FASTA file can be obtained from UCSC: (http://hgdownload.soe.ucsc.edu/downloads.html). The following example illustrates the procedure for C. elegans reference genome (36bp read length and 8 parallel threads set in options): gem-indexer -i ce10.fa -o ce10 -T 8 gem-mappability -I ce10.gem -l 36 -o ce10 -T 8 gem-2-wig -I ce10.gem -i ce10.map -o ce10 wigToBigWig ce10.wig ce10.chrom.sizes ce10.bw

By default only BEADS normalized track is exported. The export files can be control by export parameter, which is the character vector containing following values:

'BEADS': fully normalized files, i.e. GC correction, mappability correction, division by input and scaling by median
'GCandMap': GC correction and mappability correction
'GCcorected': GC correction
'readsCoverage': raw reads coverage, after extending to insert length, filtering /codemapq_cutoff and uniquing if uniq
'control_GCandMap': GC normalized and mappability corrected input, only works if input is a BAM file
'control_GCcorected': GC normalized, only works if input is a BAM file
'control_readsCoverage': input raw reads coverage, after extending to insert length, filtering mapq_cutoff and uniquing if uniq, only works if input is a BAM file

For example: beads(sample_bam, input_bam, map, fa, export=c('control_readsCoverage', 'control_GCcorected', 'control_GCandMap', 'readsCoverage', 'GCcorected', 'GCandMap', 'BEADS'))

Methods (by class)

experiment = BamFile,control = BamFile,mappability = BigWigFile,genome = ANY: Method for signature experiment='BamFile', control='BamFile', mappability='BigWigFile', genome='ANY'
experiment = BamFile,control = BigWigFile,mappability = BigWigFile,genome = ANY: Method for signature experiment='BamFile', control='BigWigFile', mappability='BigWigFile', genome='ANY'
experiment = character,control = character,mappability = character,genome = character: Method for signature experiment='character', control='character', mappability='character', genome='character'

References

http://beads.sourceforge.net/ http://www.ncbi.nlm.nih.gov/pubmed/21646344

Examples

# Get the paths of example files
sample_bam <- system.file("extdata", "GSM1208360_chrI_100Kb_q5_sample.bam", package="rbeads")
input_bam <- system.file("extdata", "Input_fE3_AA169.bam", package="rbeads")
SummedInput_bw <- system.file("extdata", "Ce10_HiSeqFRMInput_UNIQ_bin25bp_chrI_100Kb_sample.bw", package="rbeads")
map_bw <- system.file("extdata", "ce10_mappability_chrI_100Kb_sample.bw", package="rbeads")
ref_fa <- system.file("extdata", "ce10_chrI_100Kb_sample.fa", package="rbeads")

# Set the directory where the output files will be crated
setwd(tempdir())

# Run BEADS for BAM input file
beads(sample_bam, input_bam, map_bw, ref_fa)
#> Importing refference genome...
#> Importing mappability [binning]...
#> Importing input...
#> Reading alignment file [/Users/przemol/code/rbeads/inst/extdata/Input_fE3_AA169.bam]...	<0.04s>
#> Processing alignment file [uniq=TRUE]:...	<0.059s>
#> INFO: 22409 out of 27847 (80.47%) reads mapped uniquely with 10 quality cutoff: [All=27847, Mapped=27847, Qmapped=27807, Quniq=22409].
#> Calculate input GCcontent....	<0.614s>
#> Sample genome for GCcontent....	<0.005s>
#> Calculate histograms for genomic (a) nad and sample (b) GC content...
#> 	<0.016s>
#> Calculate GC weighting vector...
#> 	<0.015s>
#> Assign GC score to every read...	<0s>
#> Prepare GCscore-enriched GRanges...	<0.007s>
#> Calculate GC weighted coverage...	<0.018s>
#> Mappability correction...	<0.01s>
#> 	INFO: Total 3% of genome in masked.
#> Reading alignment file [/Users/przemol/code/rbeads/inst/extdata/GSM1208360_chrI_100Kb_q5_sample.bam]...	<0.025s>
#> Processing alignment file [uniq=TRUE]:...	<0.035s>
#> INFO: 8216 out of 10373 (79.21%) reads mapped uniquely with 10 quality cutoff: [All=10373, Mapped=10373, Qmapped=10346, Quniq=8216].
#> Calculate coverage...	<0.009s>
#> INFO: a = 18
#> Calling enriched regions...	<0.021s>
#> Mask out reads in enriched regions...	<0.009s>
#>  INFO: percentage of reads in non-enriched regions: 57.17%
#> Calculate input GCcontent....	<0.433s>
#> Sample genome for GCcontent....	<0.015s>
#> Calculate histograms for genomic (a) nad and sample (b) GC content...
#> 	<0.029s>
#> Calculate GC weighting vector...
#> 	<0.014s>
#> Assign GC score to every read...	<0s>
#> Prepare GCscore-enriched GRanges...	<0.003s>
#> Calculate GC weighted coverage...	<0.013s>
#> Mappability correction...	<0.01s>
#> 	INFO: Total 3% of genome in masked.
#> DivStep: Divideing...	<0.017s>
#> DivStep [median ]: Scaling...	<0.022s>
#> INFO: BEADS score scaling coefficient = 0.297
#> Exporing ER...
#> Exporing BigWig tracks...
#> Exporting to BigWiggle: GSM1208360_chrI_100Kb_q5_sample_BEADS_linear_1bp.bw

# Run BEADS for SummedInput (BigWig) input file
beads(sample_bam, SummedInput_bw, map_bw, ref_fa)
#> Importing refference genome...
#> Importing mappability [binning]...
#> Importing input...
#> Reading alignment file [/Users/przemol/code/rbeads/inst/extdata/GSM1208360_chrI_100Kb_q5_sample.bam]...	<0.018s>
#> Processing alignment file [uniq=TRUE]:...	<0.032s>
#> INFO: 8216 out of 10373 (79.21%) reads mapped uniquely with 10 quality cutoff: [All=10373, Mapped=10373, Qmapped=10346, Quniq=8216].
#> Calculate coverage...	<0.009s>
#> INFO: a = 18
#> Calling enriched regions...	<0.02s>
#> Mask out reads in enriched regions...	<0.01s>
#>  INFO: percentage of reads in non-enriched regions: 57.17%
#> Calculate input GCcontent....	<0.128s>
#> Sample genome for GCcontent....	<0.014s>
#> Calculate histograms for genomic (a) nad and sample (b) GC content...
#> 	<0.021s>
#> Calculate GC weighting vector...
#> 	<0.016s>
#> Assign GC score to every read...	<0s>
#> Prepare GCscore-enriched GRanges...	<0.004s>
#> Calculate GC weighted coverage...	<0.013s>
#> Mappability correction...	<0.01s>
#> 	INFO: Total 3% of genome in masked.
#> DivStep: Divideing...	<0.011s>
#> DivStep [median ]: Scaling...	<0.014s>
#> INFO: BEADS score scaling coefficient = 0.081
#> Exporing ER...
#> Exporing BigWig tracks...
#> Exporting to BigWiggle: GSM1208360_chrI_100Kb_q5_sample_BEADS_linear_1bp.bw

# NOT RUN {
## Run BEADS for BSgenome package, the reference genome package have to be installed prior to running this example
# source("http://bioconductor.org/biocLite.R")
# biocLite("BSgenome.Celegans.UCSC.ce10")
# library(BSgenome.Celegans.UCSC.ce10)
beads(sample_bam, SummedInput_bw, map_bw, genome='ce10')

## Run BEADS for all BAM files in the directory
#lapply(dir(pattern='bam$'), beads, control=input, mappability=map_bw, genome=ref_fa)
# }
# NOT RUN {
## In parallel:
# library(parallel)
# mclapply(dir(pattern='.bam$'), beads, control=input, mappability=map_bw, genome='ce10', mc.cores=parallel::detectCores())
# }