Tools and Methods
To profile the EBV genome, we developed an efficient pipeline for aligning high-throughput sequencing reads to the viral genome and subtracting out human reads. Small modifications of this pipeline can be used to map ChIP of viral transcription factors to the human genome.
We have received a number of requests to make our pipeline available. We now provide these tools; however, we've included only minimal documentation and we make no guarantee of correctness. Misuse of these tools may cause harm to your computer and/or lead to incorrect results. The tools are provided as are, under the GNU GPL.
Download the EBV (HHV4) consensus genome sequence
Download the EBV annotated genes. Note: this geneset is not perfect and is meant primarily as means of visualization and coarse summary. It is strongly recommended that you also download the original GenBank file if you want to obtain a more complete picture of the EBV coding regions.
Download our scripts for removing human reads and visualization
Download the Bowtie alignment package
Create a bowtie index (
bowtie-build ebv.fa ebv) and fasta index (
samtools faidx ebv.fa.fai)
Obtaining genomic data
High-throughput sequencing experiments are cataloged in databases such as GEO, NCBI SRA, and EGA.
To download data from NCBI SRA, you will need to obtain the
ascpprogram, which is described by NCBI in the NCBI Handbook.
For example, to download data for Vitamin D Receptor ChIP-seq dataset:
nohup ~/.aspera/connect/bin/ascp -QT -l640M -i asperaweb_id_dsa.putty \
email@example.com:/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP005/SRP005910 . &
Aligning data to EBV
Using the data in SRP00590, we can align reads to EBV:
gzip -cd SRR111956.fastq.gz | \ bowtie ebv_index -S -p 22 -v 2 -k 20 -m 20 -t -q - --un /dev/null --max /dev/null | \ samtools view -bt ebv.fa.fai - > SRR111956.ebv.k20.bowtie.align.bam
To subtract out reads that could align to the human genome, download the hg19 bowtie index from the bowtie website and run our remove_mapping_reads.py script specifying the alignment file, index to substract (hg19) and number of mismatches to use (2):
python remove_mapping_reads.py SRR111956.ebv.k20.bowtie.align.bam hg19 2 > $f.sub_hg19.out 2> $f.sub_hg19.align.stats
Visualizing the data
The reads can be strand extended and output as wig files via the bam_overlap.py script. The wig file can then be visualized using IGV or other tools.
TOTAL_READS=`samtools idxstats SRR111956.ebv.k20.bowtie.align.bam.sub_hg19.align.bam | cut -f3 | paste -s -d'+' | bc` python bam_overlap.py SRR111956.ebv.k20.bowtie.align.bam.sub_hg19.align.bam 150 75 1 $TOTAL_READS 0 0 1Parameters:
150 specifies the estimated total fragment length (better to be smaller than larger)
75 specifies how much of the fragment center should be used to estimate overlap (typically half of estimated fragment length)
1 -- ignore this param
$TOTAL_READS is the total number of reads, used to estimate RPKM
0 minimum value to output
0 minimum normalized value to output
1 [T/F] to wrap around the circular viral genome
Additional optional arguments are
[1/0] to ignore fragment size estimate and just use read overlap
[1/0] should seperate files be output for stranded data