Commit 88cd0762 authored by Tommy Tang's avatar Tommy Tang
Browse files

resolve conflicts

parents 739cde93 80dc8cfa
......@@ -3,4 +3,383 @@ a snakemake pipeline to process ChIP-seq files from GEO
I have so many people asking me to process a public GEO ChIP-seq data set for them. I hate to repeat the same steps and decide to make a pipeline for it.
switch to a different brach to see the codes. now I have `shark` branch for the LSF system.
\ No newline at end of file
switch to a different brach to see the codes. now I have `shark` branch for the LSF system.
**UPDATE** 05/30/2017. Now, the pipeline can handle in-house data as well.
Now, this is working on LSF, I will have another branch for Moab.
### work flow of the pipeline
In the `config.yaml` file you can change settings. e.g. path to a different genome to align, p value cut-offs. The `target_reads` is the number of reads that downsampled to. I set 15 million for default. If the number of reads of the orignal bam files are less than `target_reads`, the pipeline will just keep whatever the number it has.
### Dependiencies
* [snakemake]( snakemake is python3
* R > 3.3.0
you will need `optparse` package. `install.packages("optparse")`
Rscript sraDownload.R -a ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -t fastq SRR3144652
This script will download the meta file for each SRR id as well.
* aspera for downloading
check this blog post by MARK ZIEMANN
sh <(curl -s
`sraDownload.R` is under the `scripts` folder from [Luke Zappia](
## single quote your ascp command, otherwise R will be confused
Rscript sraDownload.R -a 'ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh' -t fastq SRR3144652
* [fastqc](
* `bamCoverage` v2.3.3 from [deeptools]( for making RPKM normalized and input subtracted bigwig files
* [bowtie1]( for aligning short reads (< 50bp)
* [samblaster]( v0.1.22 to remove duplicates and downsampling.
* [samtools]( v1.3.1
* [ROSE]( for calling superEnhancer. ROSE has to be run inside the installation folder. now I hard coded the path in the Snakefile. (you will have to change that to the ROSE directory in your cluster). Todo: expose the path to the `config.yaml` file that one can change.
* [macs1]( v1.4.2 and [macs2]( v2.1.1 for calling peaks (macs2 for broad peaks).
* [multiQC](
* [phantompeakqual](
`macs1`, `macs2` and `ROSE` are python2.x, see this [Using Snakemake and Macs2](!searchin/snakemake/macs%7Csort:relevance/snakemake/60txGSq81zE/NzCUTdJ_AQAJ) in the snakemake google group.
if you look inside the `Snakefile`, I did `source activate root` back to python2.x before running macs1 and macs2.
There will be [Integration of conda package management into Snakemake](
### How to distribute workflows
read [doc](
# start a screen session
# make a folder, name it yourself, I named it workdir for demon
mkdir /rsch2/genomic_med/krai/workdir/
cd /rsch2/genomic_med/krai/workdir/
git clone
cd GEOpyflow-ChIPseq
## go to downsampling branch. shark is LSF system
git checkout shark
## edit the config.yaml file as needed, e.g. set mouse or human for ref genome, p value cut off for peak calling, the number of reads you want to downsample to
nano config.yaml
## skip this if on Shark, samir has py351 set up for you. see below STEPS
conda create -n snakemake python=3 -c bioconda multiqc snakemake deeptools
source activate snakemake
## STEPS for fastq files from GEO
### Download the sra files
Prepare a txt file `SRR.txt` which has two columns: IP and Input:
cat SRR.txt
IP Input
SRR3144652 SRR3144654
SRR3144653 SRR3144654
SRR2518123 SRR2518124
SRR2518125 SRR2518126
SRR2518127 SRR2518128
SRR2518129 SRR2518130
SRR1616137 SRR1616139
SRR1616138 SRR1616139
SRR1616140 SRR1616142
SRR1616141 SRR1616142
### download the sra files using the R script
cd GEOpyflow-ChIPseq
mkdir fastqs
cd fastqs
## fastq-dump only convert the sra files to fastq to the current folderr
make a shell script:
# /bin/bash
set -euo pipefail
## you will need to change the ascp command to get the right path
Rscript ../scripts/sraDownload.R -a 'ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh' $1
`chmod u+x`
# inside the GEOpyflow-ChIPseq/fastq folder:
cat ../SRR.txt | sed '1d' | tr "\t" "\n" | sort | uniq > srr_unique.txt
## only have 4 jobs in parallel, good behavior on a cluster
cat srr_unique.txt | parallel -j 4 ./ {}
# all the sra files will be downloaded in the current fastqs folder.
Now you have all `sra` files downloaded into `fastqs` folder, proceed below:
### convert `sra` to `fastqs` and compress to .gz files
## you can use a for loop to fastq-dump the downloaded sra files.
find *sra| parallel fastq-dump {}
find *fastq | parallel bgzip {}
## save some space
rm *fastq
# go gack to the GEOpyflow-ChIPseq folder
cd ..
python3 --fastq_dir fastqs/ --meta SRR.txt
A `samples.json` file will be created and some information will be printed out on the screen.
### start the pipeline
## dry run
snakemake -np
## test for one sample
./ 07bigwig/
if no errors, preceed below.
### Using [DRMAA](
[job control through drmaa](
DRMAA is only supported on `Shark` (LSF system).
module load drmma
Using `drmaa` can `control + c` to stop the current run.
Dependent jobs are submitted one by one, if some jobs failed, the pipeline will stop. Good for initital testing.
### submit all jobs to the cluster
All jobs will be submitted to the cluster on queue. This is useful if you know your jobs will succeed for most of them and the jobs are on queue to gain priority.
## process the custom data produced from the sequencing core.
Different People have different naming conventions, to accomondate this situation, I require them to give me a `meta.txt` tab delimited file to have the IP and Input pair name information.
The `` script assumes that the name of the IP and Input in the `meta.txt` file exist in the fastq files.
cd GEOpyflow-ChIPseq
cat meta.txt
IP Input factor reference
Lan-1-5-17-1-A Lan-7-5-17-12-A H3K4me3 mouse+drosophila
Lan-1-5-17-1-B Lan-8-5-12-B H3K4me3 mouse+drosophila
Lan-1-5-17-1-C Lan-7-5-17-12-C H3K4me3 mouse+drosophila
Lan-1-5-17-1-D Lan-7-5-17-12-D H3K4me3 mouse+drosophila
Lan-1-5-17-1-E Lan-7-5-17-12-A KDM5D mouse
Lan-1-5-17-1-F Lan-8-5-12-B KDM5D mouse
Lan-1-5-17-1-G Lan-7-5-17-12-C KDM5D mouse
Lan-2-5-17-3-A Lan-7-5-17-12-A Tbx3 mouse
Lan-2-5-17-3-B Lan-8-5-12-B Tbx3 mouse
Lan-2-5-17-3-C Lan-7-5-17-12-C Tbx3 mouse
Lan-2-5-17-3-D Lan-7-5-17-12-D Tbx3 mouse
Lan-8-5-17-1-H Lan-7-5-17-12-D KDM5D mouse
## only the first 2 columns are required.
## make a samples.json file
python3 --fastq_dir dir/to/fastqs/ --meta meta.txt
The real name of the fastq files:
check the example `samples.json` file in the repo.
Now, the same as the steps as processing the `sra` files
# dry run -np
### Extra notes on file names
If one sets up a lab, it is necessary to have consistent file naming across all the projects. `TCGA`project is a great example for us to follow. A [barcode system]( can make your life a lot easier for downstream analysis.
Similarily, for a ChIP-seq project, it is important to have consistent naming.
In Dr.Kunal Rai'lab. We adopted a barcoding system similar to TCGA:
`TCGA` is the big project name;
`SKCM` is the tumor name;
`M028` is the sample name (this should be an unique identifier);
`11` is the sequencing tag;
we use `11` to denote first time IP, first time sequencing, if the reads number is too few, but the IP worked, we just need to resequence the same library. for the resequencing sample, we will use `12` for this. if the total reads number is still too low, `13` could be used. `21` will be second time IP and first time sequencing. etc.
`P008` is the plate number of that IP experiment, we now use 96-well plate for ChIP-seq, we use this id to track which plate the samples are from.
`A` is the chromatin mark name or transcription factor name. we have a naming map for this:
`A` is H3K4me1, `B` is H3K9me3 and `G` is for Input etc.
The other barcode areas can be used for other information. `NC` means the samples were sequenced in north campus.
It saves me a lot in the downstream processing. The barcode can be captured by a universal regular expression from the fastq.gz files.
A real experiment comes a fastq.gz name like this
multiplexing is very common nowadays, the sequencing reads for the same sample may come from different lanes, after de-multiplexing, multiple files for the same sample will be put in the same folder. If you name your files in a consistent way, you can easily merge all the fastq files before mapping. (for DNA sequencing, it is recommended to map the fastq files independently and then merge the mapped bam files with read-group to identify which lane it is from).
It also helps for merging the fastq files from two different rounds of sequencing. I know sequencing tag `11` and `12` with the same sample name and chromatin mark name are for the same sample, so I can merge them together programatically.
I also know that `G` is a Input control sample, I can then call peaks, make Input subtracted bigwigs etc using a IP vs Input pattern. (A_vs_G, B_vs_G). Same idea can be used for `tumor` and `control` for whole genome sequencing when calling mutations and copynumber.
Many other people out of our lab let me process their data, I can not enforce naming of the files before they carry out the experiments. That's why I require them to give me a `meta.txt` file instead.
### job control
To kill all of your pending jobs you can use the command:
bkill `bjobs -u krai |grep PEND |cut -f1 -d" "`
other useful commands:
bjobs -pl
Display detailed information of all pending jobs of the invoker.
bjobs -ps
Display only pending and suspended jobs.
bjobs -u all -a
Display all jobs of all users.
bjobs -d -q short -m apple -u mtang1
Display all the recently finished jobs submitted by john to the
queue short, and executed on the host apple.
### rerun some of the jobs
# specify the name of the rule, all files that associated with that rule will be rerun. e.g. rerun macs2 calling peaks rule,
./pyflow-ChIPseq -R call_peaks_macs2
## rerun one sample, just specify the name of the target file
./pyflow-ChIPseq -R 02aln/SRR3144652.sorted.bam
# check snakemake -f, -R, --until options
./pyflow-ChIPseq -f call_peaks_macs2
### checking results after run finish
snakemake --summary | sort -k1,1 | less -S
# or detailed summary will give you the commands used to generated the output and what input is used
snakemake --detailed-summary | sort -k1,1 > snakemake_run_summary.txt
### clean the folders
I use echo to see what will be removed first, then you can remove all later.
find . -maxdepth 1 -type d -name "[0-9]*" | xargs echo rm -rf
### Snakemake does not trigger re-runs if I add additional input files. What can I do?
Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:
snakemake -n -R `snakemake --list-input-changes`
### How do I trigger re-runs for rules with updated code or parameters?
snakemake -n -R `snakemake --list-params-changes`
snakemake -n -R `snakemake --list-code-changes`
### TO DO list
**provide a output directory** now everything will be output in the current GEOpyflow-ChIPseq directory in a structured fasion. : `00log`, `01seq`, `02fqc`, `03aln` etc
**work for paired-end ChIPseq as well** now only for single-end.
**put everything in docker**
>>>>>>> shark
IP Input
SRR3144652 SRR3144654
SRR3144653 SRR3144654
SRR2518123 SRR2518124
SRR2518125 SRR2518126
SRR2518127 SRR2518128
SRR2518129 SRR2518130
SRR1616137 SRR1616139
SRR1616138 SRR1616139
SRR1616140 SRR1616142
SRR1616141 SRR1616142
shell.prefix("set -eo pipefail; echo BEGIN at $(date); ")
shell.suffix("; exitstat=$?; echo END at $(date); echo exit status was $exitstat; exit $exitstat")
configfile: "config.yaml"
localrules: all
# localrules will let the rule run locally rather than submitting to cluster
# computing nodes, this is for very small jobs
# load cluster config file
CLUSTER = json.load(open(config['CLUSTER_JSON']))
FILES = json.load(open(config['SAMPLES_JSON']))
import csv
import os
SAMPLES = sorted(FILES.keys())
## will be named as sample_IP sample_Input
for sample in SAMPLES:
for mark in FILES[sample].keys():
MARK_SAMPLES.append(sample + "_" + mark)
CONTROLS = [sample for sample in MARK_SAMPLES if '_Input' in sample]
CASES = [sample for sample in MARK_SAMPLES if '_Input' not in sample]
## multiple samples may use the same control input files
## list BAM files
CONTROL_BAM = expand("03aln/{sample}.sorted.bam", sample=CONTROLS_UNIQUE)
CASE_BAM = expand("03aln/{sample}.sorted.bam", sample=CASES)
## create target for peak-calling: will call the rule call_peaks in order to generate bed files
## note: the "zip" function allow the correct pairing of the BAM files
ALL_PEAKS = expand("08peak_macs1/{case}_vs_{control}_macs1_peaks.bed", zip, case=CASES, control=CONTROLS)
ALL_PEAKS.extend(expand("08peak_macs1/{case}_vs_{control}_macs1_nomodel_peaks.bed", zip, case=CASES, control=CONTROLS))
ALL_PEAKS.extend(expand("09peak_macs2/{case}_vs_{control}_macs2_peaks.xls", zip, case=CASES, control=CONTROLS))
ALL_inputSubtract_BIGWIG = expand("06bigwig_inputSubtract/{case}_subtract_{control}.bw", zip, case=CASES, control=CONTROLS)
ALL_SUPER = expand("11superEnhancer/{case}_vs_{control}-super/", zip, case=CASES, control=CONTROLS)
ALL_DOWNSAMPLE_BAM = expand("04aln_downsample/{sample}-downsample.sorted.bam", sample = ALL_SAMPLES)
ALL_FASTQ = expand("01seq/{sample}.fastq", sample = ALL_SAMPLES)
ALL_FASTQC = expand("02fqc/{sample}", sample = ALL_SAMPLES)
ALL_INDEX = expand("03aln/{sample}.sorted.bam.bai", sample = ALL_SAMPLES)
ALL_DOWNSAMPLE_INDEX = expand("04aln_downsample/{sample}-downsample.sorted.bam.bai", sample = ALL_SAMPLES)
ALL_FLAGSTAT = expand("03aln/{sample}.sorted.bam.flagstat", sample = ALL_SAMPLES)
ALL_PHATOM = expand("05phantompeakqual/{sample}_phantom.txt", sample = ALL_SAMPLES)
ALL_BIGWIG = expand("07bigwig/{sample}.bw", sample = ALL_SAMPLES)
ALL_QC = ["10multiQC/multiQC_log.html"]
localrules: all
rule all:
## get a list of fastq.gz files for the same mark, same sample
def get_fastq(wildcards):
sample = "_".join(wildcards.sample.split("_")[0:-1])
mark = wildcards.sample.split("_")[-1]
return FILES[sample][mark]
## now only for single-end ChIPseq,
rule merge_fastqs:
input: get_fastq
output: "01seq/{sample}.fastq"
log: "00log/{sample}_unzip"
threads: CLUSTER["merge_fastqs"]["cpu"]
params: jobname = "{sample}"
message: "merging fastqs gunzip -c {input} > {output}"
shell: "gunzip -c {input} > {output} 2> {log}"
rule fastqc:
input: "01seq/{sample}.fastq"
output: "02fqc/{sample}", "02fqc/{sample}_fastqc.html"
log: "00log/{sample}_fastqc"
threads: CLUSTER["fastqc"]["cpu"]
params : jobname = "{sample}"
message: "fastqc {input}: {threads}"
module load fastqc
fastqc -o 02fqc -f fastq --noextract {input} 2> {log}
# get the duplicates marked sorted bam, remove unmapped reads by samtools view -F 4 and dupliated reads by samblaster -r
# samblaster should run before samtools sort
rule align:
input: "01seq/{sample}.fastq"
output: "03aln/{sample}.sorted.bam", "00log/{sample}.align"
threads: CLUSTER["align"]["cpu"]
bowtie = "--chunkmbs 320 -m 1 --best -p 5 ",
jobname = "{sample}"
message: "aligning {input}: {threads} threads"
bowtie = "00log/{sample}.align",
markdup = "00log/{sample}.markdup"
bowtie {params.bowtie} {config[idx_bt1]} -q {input} -S 2> {log.bowtie} \
| samblaster --removeDups \
| samtools view -Sb -F 4 - \
| samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]} 2> {log.markdup}
rule index_bam:
input: "03aln/{sample}.sorted.bam"
output: "03aln/{sample}.sorted.bam.bai"
log: "00log/{sample}.index_bam"
threads: 1
params: jobname = "{sample}"
message: "index_bam {input}: {threads} threads"
samtools index {input} 2> {log}
# check number of reads mapped by samtools flagstat, the output will be used for downsampling
rule flagstat_bam:
input: "03aln/{sample}.sorted.bam"
output: "03aln/{sample}.sorted.bam.flagstat"
log: "00log/{sample}.flagstat_bam"
threads: 1
params: jobname = "{sample}"
message: "flagstat_bam {input}: {threads} threads"
samtools flagstat {input} > {output} 2> {log}
rule phantom_peak_qual:
input: "03aln/{sample}.sorted.bam", "03aln/{sample}.sorted.bam.bai"
output: "05phantompeakqual/{sample}_phantom.txt"
log: "00log/{sample}_phantompeakqual.log"
threads: 4
params: jobname = "{sample}"
message: "phantompeakqual for {input}"
source activate root
Rscript /scratch/genomic_med/apps/phantompeak/phantompeakqualtools/run_spp_nodups.R -c={input[0]} -savp -rf -p=4 -odir=05phantompeakqual -out={output} -tmpdir=05phantompeakqual 2> {log}
rule down_sample:
input: "03aln/{sample}.sorted.bam", "03aln/{sample}.sorted.bam.bai", "03aln/{sample}.sorted.bam.flagstat"
output: "04aln_downsample/{sample}-downsample.sorted.bam", "04aln_downsample/{sample}-downsample.sorted.bam.bai"
log: "00log/{sample}_downsample.log"
threads: 5
params: jobname = "{sample}"
message: "downsampling for {input}"
import re
import subprocess
with open (input[2], "r") as f:
# fifth line contains the number of mapped reads
line = f.readlines()[4]
match_number = re.match(r'(\d.+) \+.+', line)
total_reads = int(
target_reads = config["target_reads"] # 15million reads by default, set up in the config.yaml file
if total_reads > target_reads:
down_rate = target_reads/total_reads
down_rate = 1
shell("sambamba view -f bam -t 5 --subsampling-seed=3 -s {rate} {inbam} | samtools sort -m 2G -@ 5 -T {outbam}.tmp > {outbam} 2> {log}".format(rate = down_rate, inbam = input[0], outbam = output[0], log = log))
shell("samtools index {outbam}".format(outbam = output[0]))
rule make_inputSubtract_bigwigs:
input : "04aln_downsample/{control}-downsample.sorted.bam", "04aln_downsample/{case}-downsample.sorted.bam", "04aln_downsample/{control}-downsample.sorted.bam.bai", "04aln_downsample/{case}-downsample.sorted.bam.bai"
output: "06bigwig_inputSubtract/{case}_subtract_{control}.bw"
log: "00log/{case}_inputSubtract.makebw"
threads: 5
params: jobname = "{case}"
message: "making input subtracted bigwig for {input}"
source activate root
bamCompare --bamfile1 {input[1]} --bamfile2 {input[0]} --normalizeUsingRPKM --ratio subtract --binSize 30 --smoothLength 300 -p 5 --extendReads 200 -o {output} 2> {log}
rule make_bigwigs:
input : "04aln_downsample/{sample}-downsample.sorted.bam", "04aln_downsample/{sample}-downsample.sorted.bam.bai"
output: "07bigwig/{sample}.bw"
log: "00log/{sample}.makebw"
threads: 5
params: jobname = "{sample}"
message: "making bigwig for {input}"
source activate root
bamCoverage -b {input[0]} --normalizeUsingRPKM --binSize 30 --smoothLength 300 -p 5 --extendReads 200 -o {output} 2> {log}
rule call_peaks_macs1:
input: control = "04aln_downsample/{control}-downsample.sorted.bam", case="04aln_downsample/{case}-downsample.sorted.bam"
output: "08peak_macs1/{case}_vs_{control}_macs1_peaks.bed", "08peak_macs1/{case}_vs_{control}_macs1_nomodel_peaks.bed"
macs1 = "00log/{case}_vs_{control}_call_peaks_macs1.log",
macs1_nomodel = "00log/{case}_vs_{control}_call_peaks_macs1_nomodel.log"
name1 = "{case}_vs_{control}_macs1",
name2 = "{case}_vs_{control}_macs1_nomodel",
jobname = "{case}"
message: "call_peaks macs14 {input}: {threads} threads"
source activate root
macs14 -t {} \
-c {input.control} --keep-dup all -f BAM -g {config[macs_g]} \
--outdir 08peak_macs1 -n {params.name1} -p {config[macs_pvalue]} &> {log.macs1}
# nomodel for macs14, shift-size will be 100 bp (e.g. fragment length of 200bp)
# can get fragment length from the phantompeakqual. Now set it to 200 bp for all.
macs14 -t {} \
-c {input.control} --keep-dup all -f BAM -g {config[macs_g]} \
--outdir 08peak_macs1 -n {params.name2} --nomodel -p {config[macs_pvalue]} &> {log.macs1_nomodel}
rule call_peaks_macs2:
input: control = "04aln_downsample/{control}-downsample.sorted.bam", case="04aln_downsample/{case}-downsample.sorted.bam"
output: bed = "09peak_macs2/{case}_vs_{control}_macs2_peaks.xls"
log: "00log/{case}_vs_{control}_call_peaks_macs2.log"
name = "{case}_vs_{control}_macs2",
jobname = "{case}"
message: "call_peaks macs2 {input}: {threads} threads"
source activate root
## for macs2, when nomodel is set, --extsize is default to 200bp, this is the same as 2 * shift-size in macs14.
macs2 callpeak -t {} \
-c {input.control} --keep-dup all -f BAM -g {config[macs2_g]} \
--outdir 09peak_macs2 -n {} -p {config[macs2_pvalue]} --broad --broad-cutoff {config[macs2_pvalue_broad]} --nomodel &> {log}
rule multiQC:
input :
expand("00log/{sample}.align", sample = ALL_SAMPLES),
expand("03aln/{sample}.sorted.bam.flagstat", sample = ALL_SAMPLES),
expand("02fqc/{sample}", sample = ALL_SAMPLES)