README.md 13.4 KB
Newer Older
Ming Tang's avatar
Ming Tang committed
1
2
# GEOpyflow-ChIPseq
a snakemake pipeline to process ChIP-seq files from GEO
Tommy Tang's avatar
Tommy Tang committed
3

Tommy Tang's avatar
update    
Tommy Tang committed
4
5
I have so many people asking me to process a public GEO ChIP-seq data set for them. I hate to repeat the same steps and decide to make a pipeline for it.

Tommy Tang's avatar
Tommy Tang committed
6
switch to a different branch to see the codes. now I have `shark` branch for the LSF system.
Tommy Tang's avatar
Tommy Tang committed
7

Tommy Tang's avatar
Tommy Tang committed
8
9
**UPDATE** 05/30/2017. Now, the pipeline can handle in-house data as well.

Tommy Tang's avatar
update    
Tommy Tang committed
10
Now, this is working on LSF, I will have another branch for Moab.
Tommy Tang's avatar
Tommy Tang committed
11
12
13

### work flow of the pipeline

Tommy Tang's avatar
Tommy Tang committed
14
![](./GEO_rulegraph.png)
Tommy Tang's avatar
Tommy Tang committed
15

Tommy Tang's avatar
Tommy Tang committed
16
In the `config.yaml` file you can change settings. e.g. path to a different genome to align, p value cut-offs. The `target_reads` is the number of reads that downsampled to. I set 15 million for default. If the number of reads of the orignal bam files are less than `target_reads`, the pipeline will just keep whatever the number it has.
Tommy Tang's avatar
Tommy Tang committed
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

### Dependiencies

* [snakemake](https://bitbucket.org/snakemake/snakemake). snakemake is python3
* R > 3.3.0
you will need `optparse` package. `install.packages("optparse")`
`SRAdb`

```r
source("https://bioconductor.org/biocLite.R")
biocLite("SRAdb")
```

```
 Rscript sraDownload.R  -a ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -t fastq SRR3144652
```

Tommy Tang's avatar
Tommy Tang committed
34
35
This script will download the meta file for each SRR id as well.

Tommy Tang's avatar
Tommy Tang committed
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
* aspera for downloading

check this blog post by MARK ZIEMANN http://genomespot.blogspot.com/2015/05/download-sra-data-with-aspera-command.html

```bash
sh <(curl -s aspera-connect-3.6.2.117442-linux-64.sh)
```

`sraDownload.R` is under the `scripts` folder from [Luke Zappia](https://github.com/lazappi):

```bash
## single quote your ascp command, otherwise R will be confused
Rscript sraDownload.R  -a 'ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh' -t fastq SRR3144652

```

* [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
* `bamCoverage` v2.3.3 from [deeptools](https://github.com/fidelram/deepTools) for making RPKM normalized and input subtracted bigwig files
* [bowtie1](http://bowtie-bio.sourceforge.net/index.shtml) for aligning short reads (< 50bp)
Tommy Tang's avatar
Tommy Tang committed
55
* [samblaster](https://github.com/GregoryFaust/samblaster) v0.1.22 to remove duplicates and downsampling.
Tommy Tang's avatar
Tommy Tang committed
56
57
58
59
* [samtools](http://www.htslib.org/) v1.3.1
* [ROSE](http://younglab.wi.mit.edu/super_enhancer_code.html) for calling superEnhancer. ROSE has to be run inside the installation folder. now I hard coded the path in the Snakefile. (you will have to change that to the ROSE directory in your cluster). Todo: expose the path to the `config.yaml` file that one can change.
* [macs1](https://pypi.python.org/pypi/MACS/1.4.2) v1.4.2 and [macs2](https://github.com/taoliu/MACS) v2.1.1 for calling peaks (macs2 for broad peaks).
* [multiQC](http://multiqc.info/)
Tommy Tang's avatar
Tommy Tang committed
60
* [phantompeakqual](https://github.com/kundajelab/phantompeakqualtools)
Tommy Tang's avatar
Tommy Tang committed
61

Tommy Tang's avatar
cd back    
Tommy Tang committed
62

Tommy Tang's avatar
Tommy Tang committed
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
`macs1`, `macs2` and `ROSE` are python2.x, see this [Using Snakemake and Macs2](https://groups.google.com/forum/#!searchin/snakemake/macs%7Csort:relevance/snakemake/60txGSq81zE/NzCUTdJ_AQAJ) in the snakemake google group.

if you look inside the `Snakefile`, I did `source activate root` back to python2.x before running macs1 and macs2.

There will be [Integration of conda package management into Snakemake](https://bitbucket.org/snakemake/snakemake/pull-requests/92/wip-integration-of-conda-package/diff)


### How to distribute workflows

read [doc](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html)

```bash
ssh shark.mdanderson.org

# start a screen session
screen

# make a folder, name it yourself, I named it workdir for demon
mkdir /rsch2/genomic_med/krai/workdir/

cd /rsch2/genomic_med/krai/workdir/

git clone https://github.com/crazyhottommy/GEOpyflow-ChIPseq

cd GEOpyflow-ChIPseq

## go to downsampling branch. shark is LSF system
git checkout shark

## edit the config.yaml file as needed, e.g. set mouse or human for ref genome, p value cut off for peak calling, the number of reads you want to downsample to
nano config.yaml

## skip this if on Shark, samir has py351 set up for you. see below STEPS
conda create -n snakemake python=3 -c bioconda multiqc snakemake deeptools
source activate snakemake
```

100
## STEPS for fastq files from GEO
Tommy Tang's avatar
Tommy Tang committed
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124

### Download the sra files

Prepare a txt file `SRR.txt` which has two columns: IP and Input:

e.g.

```bash
cat SRR.txt

IP      Input
SRR3144652      SRR3144654
SRR3144653      SRR3144654
SRR2518123      SRR2518124
SRR2518125      SRR2518126
SRR2518127      SRR2518128
SRR2518129      SRR2518130
SRR1616137      SRR1616139
SRR1616138      SRR1616139
SRR1616140      SRR1616142
SRR1616141      SRR1616142

```

125
### download the sra files using the R script
Tommy Tang's avatar
Tommy Tang committed
126
127
128

```bash
cd GEOpyflow-ChIPseq
129
130
131
mkdir fastqs
cd fastqs
## fastq-dump only convert the sra files to fastq to the current folderr
Tommy Tang's avatar
Tommy Tang committed
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
```

make a shell script:
`download.sh`

```bash
# /bin/bash
set -euo pipefail

## you will need to change the ascp command to get the right path
Rscript ../scripts/sraDownload.R -a 'ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh' $1
```

`chmod u+x download.sh`

```bash
148
# inside the GEOpyflow-ChIPseq/fastq folder:
Tommy Tang's avatar
Tommy Tang committed
149
150
151
152
153
cat ../SRR.txt | sed '1d' | tr "\t" "\n" | sort | uniq > srr_unique.txt

## only have 4 jobs in parallel, good behavior on a cluster
cat srr_unique.txt | parallel -j 4 ./download.sh {}

154
# all the sra files will be downloaded in the current fastqs folder.
Tommy Tang's avatar
Tommy Tang committed
155
156
```

157
Now you have all `sra` files downloaded into `fastqs` folder, proceed below:
Tommy Tang's avatar
Tommy Tang committed
158

159
### convert `sra` to `fastqs` and compress to .gz files 
Tommy Tang's avatar
Tommy Tang committed
160
161

```bash
162
163
164
165
166
167
168
169
170

## you can use a for loop to fastq-dump the downloaded sra files.
find *sra| parallel fastq-dump {}

find *fastq | parallel bgzip {}

## save some space
rm *fastq

Tommy Tang's avatar
cd back    
Tommy Tang committed
171
172
# go gack to the GEOpyflow-ChIPseq folder
cd ..
173
174
175
176
177
178
179
180
181

python3 sample2json.py --fastq_dir fastqs/ --meta SRR.txt 
```

A `samples.json` file will be created and some information will be printed out on the screen.

### start the pipeline

```bash
Tommy Tang's avatar
Tommy Tang committed
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
## dry run
snakemake -np

## test for one sample
./pyflow-ChIPseq.sh  07bigwig/SRR2518125.bw

```

if no errors, preceed below.

### Using [DRMAA](https://www.drmaa.org/)

[job control through drmaa](http://drmaa-python.readthedocs.io/en/latest/tutorials.html#controlling-a-job)

DRMAA is only supported on `Shark` (LSF system).

```bash
module load drmma
./pyflow-drmaa-ChIPseq.sh
```

Using `drmaa` can `control + c` to stop the current run.

Dependent jobs are submitted one by one, if some jobs failed, the pipeline will stop. Good for initital testing.

### submit all jobs to the cluster

```bash
./pyflow-ChIPseq.sh 
```

All jobs will be submitted to the cluster on queue.  This is useful if you know your jobs will succeed for most of them and the jobs are on queue to gain priority.

215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
## process the custom data produced from the sequencing core.

Different People have different naming conventions, to accomondate this situation, I require them to give me a `meta.txt` tab delimited file to have the IP and Input pair name information.

The `sample2json.py` script assumes that the name  of the IP and Input in the `meta.txt` file exist in the fastq files.

e.g.

```bash
cd GEOpyflow-ChIPseq
cat meta.txt
IP      Input   factor  reference
Lan-1-5-17-1-A  Lan-7-5-17-12-A H3K4me3 mouse+drosophila
Lan-1-5-17-1-B  Lan-8-5-12-B    H3K4me3 mouse+drosophila
Lan-1-5-17-1-C  Lan-7-5-17-12-C H3K4me3 mouse+drosophila
Lan-1-5-17-1-D  Lan-7-5-17-12-D H3K4me3 mouse+drosophila
Lan-1-5-17-1-E  Lan-7-5-17-12-A KDM5D   mouse
Lan-1-5-17-1-F  Lan-8-5-12-B    KDM5D   mouse
Lan-1-5-17-1-G  Lan-7-5-17-12-C KDM5D   mouse
Lan-2-5-17-3-A  Lan-7-5-17-12-A Tbx3    mouse
Lan-2-5-17-3-B  Lan-8-5-12-B    Tbx3    mouse
Lan-2-5-17-3-C  Lan-7-5-17-12-C Tbx3    mouse
Lan-2-5-17-3-D  Lan-7-5-17-12-D Tbx3    mouse
Lan-8-5-17-1-H  Lan-7-5-17-12-D KDM5D   mouse

## only the first 2 columns are required.

## make a samples.json file
python3 sample2json.py --fastq_dir dir/to/fastqs/ --meta meta.txt 
```

The real name of the fastq files:  

`Lan-1-5-17-1-G_S7_L001_R1_001.fastq.gz`    
`Lan-7-5-17-12-C_S30_L005_R1_001.fastq.gz`

check the example `samples.json` file in the repo.

Now, the same as the steps as processing the `sra` files

```bash
# dry run
pyflow-ChIPseq.sh  -np 
```


### Extra notes on file names
Tommy Tang's avatar
Tommy Tang committed
262
If one sets up a lab, it is necessary to have consistent file naming across all the projects. `TCGA`project is a great example for us to follow. A [barcode system](https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode) can make your life a lot easier for downstream analysis.
263
264
265
266

![](./TCGA.png)

Similarily, for a ChIP-seq project, it is important to have consistent naming.
Tommy Tang's avatar
Tommy Tang committed
267
In Dr.Kunal Rai'lab. We adopted a barcoding system similar to TCGA:
268

Tommy Tang's avatar
Tommy Tang committed
269
e.g.
270
271
272
273
274
275
`TCGA-SKCM-M028-11-P008-A-NC-CJT-T`

`TCGA` is the big project name;  
`SKCM` is the tumor name;
`M028` is the sample name (this should be an unique identifier);  
`11` is the sequencing tag;
Tommy Tang's avatar
Tommy Tang committed
276
we use `11` to denote first time IP, first time sequencing, if the reads number is too few, but the IP worked, we just need to resequence the same library. for the resequencing sample, we will use `12` for this. if the total reads number is still too low, `13` could be used. `21` will be second time IP and first time sequencing. etc.  
277
278
279
280
281
282
`P008` is the plate number of that IP experiment, we now use 96-well plate for ChIP-seq, we use this id to track which plate the samples are from.  
`A` is the chromatin mark name or transcription factor name. we have a naming map for this:  
`A` is H3K4me1, `B` is H3K9me3 and `G` is for Input etc.  

The other barcode areas can be used for other information. `NC` means the samples were sequenced in north campus.

Tommy Tang's avatar
Tommy Tang committed
283
It saves me a lot in the downstream processing. The barcode can be captured by a universal regular expression from the fastq.gz files.
284
285
286
287
288
289
290
291
292

A real experiment comes a fastq.gz name like this 

`TCGA-SKCM-M028-R1-P008-C-NC-CJT-T_TTAGGC_L001_R1_006.fastq.gz`  

multiplexing is very common nowadays, the sequencing reads for the same sample may come from different lanes, after de-multiplexing, multiple files for the same sample will be put in the same folder. If you name your files in a consistent way, you can easily merge all the fastq files before mapping. (for DNA sequencing, it is recommended to map the fastq files independently and then merge the mapped bam files with read-group to identify which lane it is from).

It also helps for merging the fastq files from two different rounds of sequencing. I know sequencing tag `11` and `12` with the same sample name and chromatin mark name are for the same sample, so I can merge them together programatically.

Tommy Tang's avatar
Tommy Tang committed
293
I also know that `G` is a Input control sample, I can then call peaks, make Input subtracted bigwigs etc using a IP vs Input pattern. (A_vs_G, B_vs_G). Same idea can be used for `tumor` and `control` for whole genome sequencing when calling mutations and copynumber.
294
295
296

Many other people out of our lab let me process their data, I can not enforce naming of the files before they carry out the experiments. That's why I require them to give me a `meta.txt` file instead.

Tommy Tang's avatar
Tommy Tang committed
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
### job control

To kill all of your pending jobs you can use the command:

```bash
bkill `bjobs -u krai |grep PEND |cut -f1 -d" "`
```

other useful commands: 

```
bjobs -pl
Display detailed information of all pending jobs of the invoker.

bjobs -ps
Display only pending and suspended jobs.

bjobs -u all -a
Display all jobs of all users.

bjobs -d -q short -m apple -u mtang1
Display all the recently finished jobs submitted by john to the
queue short, and executed on the host apple.
```

### rerun some of the jobs

```bash

# specify the name of the rule, all files that associated with that rule will be rerun. e.g. rerun macs2 calling peaks rule,
./pyflow-ChIPseq -R call_peaks_macs2

## rerun one sample, just specify the name of the target file

Tommy Tang's avatar
Tommy Tang committed
331
./pyflow-ChIPseq -R 02aln/SRR3144652.sorted.bam
Tommy Tang's avatar
Tommy Tang committed
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378

# check snakemake -f, -R, --until options
./pyflow-ChIPseq -f call_peaks_macs2
```

### checking results after run finish

```bash

snakemake --summary | sort -k1,1 | less -S

# or detailed summary will give you the commands used to generated the output and what input is used
snakemake --detailed-summary | sort -k1,1 > snakemake_run_summary.txt
```


### clean the folders

I use echo to see what will be removed first, then you can remove all later.

```
find . -maxdepth 1 -type d -name "[0-9]*" | xargs echo rm -rf
```


### Snakemake does not trigger re-runs if I add additional input files. What can I do?

Snakemake has a kind of “lazy” policy about added input files if their modification date is older than that of the output files. One reason is that information what to do cannot be inferred just from the input and output files. You need additional information about the last run to be stored. Since behaviour would be inconsistent between cases where that information is available and where it is not, this functionality has been encoded as an extra switch. To trigger updates for jobs with changed input files, you can use the command line argument –list-input-changes in the following way:

```bash
snakemake -n -R `snakemake --list-input-changes`

```

### How do I trigger re-runs for rules with updated code or parameters?

```bash
snakemake -n -R `snakemake --list-params-changes`
```

and

```bash
snakemake -n -R `snakemake --list-code-changes`
```


Tommy Tang's avatar
Tommy Tang committed
379
380
### TO DO list

Tommy Tang's avatar
Tommy Tang committed
381
382
383
**provide a output directory** now everything will be output in the current GEOpyflow-ChIPseq directory in a structured fasion. : `00log`, `01seq`, `02fqc`, `03aln` etc  
**work for paired-end ChIPseq as well** now only for single-end.   
**put everything in docker**    
Tommy Tang's avatar
Tommy Tang committed
384

Tommy Tang's avatar
Tommy Tang committed
385
>>>>>>> shark