added questions cell in snakemake notebook

d4b90507 · Eric CHARPENTIER · cbd75dd8 · d4b90507
--- a/notebooks/5-Snakemake.ipynb
+++ b/notebooks/5-Snakemake.ipynb
@@ -571,6 +571,17 @@
   "source": [
    "## Execution of a snakemake example in a terminal"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "source": [
+    "# Questions?"
+   ]
  }
 ],
 "metadata": {
@@ -590,7 +601,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.3"
+   "version": "3.6.10"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:

 <div style="text-align:center;">
 <h1>Introduction to snakemake workflows</h1>

 <center>

 <img src="image_folder/snakemakeLogo.png" style="background-color: #212f3d;"/>
 </center>
 <br>
 <a href="https://snakemake.github.io/">https://snakemake.github.io/</a>
 </div>

 %% Cell type:markdown id: tags:

 ## snakemake philosophy

 **snakemake** is workflow manager used to create **reproducible** and **scalable** data analyses.
 It is based on the definition of rules which are the "bricks" of the pipeline. Each rule is defined by:
 - input file(s)
 - output file(s)
 - a tool (or script or python code) allowing to process the input files to produce the output files

 ![](image_folder/snakeRule.png)

 [https://snakemake.readthedocs.io/en/stable/](https://snakemake.readthedocs.io/en/stable/)

 %% Cell type:markdown id: tags:

 ### Rule example

 ```
 rule index_bam:
    input:  "sample.bam"
 	output: "sample.bai"
 	shell:  "samtools index {input} {output}"
 ```
 **snakemake** allows to define the logic (steps) of the workflow and apply this logic to any number of samples.

 %% Cell type:markdown id: tags:

 ### The Snakefile

 All rules are defined into a file written in the python programming language.
 By default, **snakemake** will search for a file called `Snakefile` in the directory where it is executed but any file can be used as long as the snakemake rule definition are well written.

 In order to execute the jobs written in a `Snakefile`, all that is needed is:
 ```
 snakemake -j X
 ```
 where X is the number of parallel jobs to be run.

 %% Cell type:markdown id: tags:

 ### Execution order of rules

 **snakemake** is a "top to bottom" worflow manager:
 - the first rule lists the target files to create
 - the execution order will then be determined from the input and output files of each rule

 ![](image_folder/snakeExecutionOrder.png)

 %% Cell type:markdown id: tags:

 ### Execution order example

 ```
 rule all:
    input:  "father.bam","mother.bam","child.bam"

 rule align:
 	input:  ref="data/genome/ref.fa",
            index="data/genome/ref.fa.bwt",
            r1="data/fastq/{sample}.R1.fq.gz",
            r2="data/fastq/{sample}.R2.fq.gz"
 	output: "data/align/{sample}.bam"
 	shell:  "bwa mem -R '@RG\\tID:{wildcards.sample}\\tSM:{wildcards.sample}' {input.ref} {input.r1} {input.r2} | samtools sort -T {output}.tmp -O bam -o {output} -"

 rule index_ref:
 	input:  "data/genome/ref.fa"
 	output: "data/genome/ref.fa.bwt","data/genome/ref.fa.amb","data/genome/ref.fa.ann","data/genome/ref.fa.pac","data/genome/ref.fa.sa"
 	shell:  "bwa index {input}"
 ```

 %% Cell type:markdown id: tags:

 ### Visualizing the execution order

 ```
 snakemake --rulegraph | dot -Tpng > snakeRulegraph.png
 ```

 %% Cell type:markdown id: tags:

 <center>
 <img src="image_folder/snakeRulegraph.png"/>
 </center>

 %% Cell type:markdown id: tags:

 ### Parallelization

 **snakemake** will parallelize every step whenever it can.

 <center>
 <img src="image_folder/snakeParallelization.png"/>
 </center>

 %% Cell type:markdown id: tags:

 ### Vizualizing the parallelization

 In our data, we have 3 different individuals and thus, some steps can be parallelized.
 ```
 snakemake --dag | dot -Tpng > snakeDAG.png
 ```

 %% Cell type:markdown id: tags:

 <center>
 <img src="image_folder/snakeDAG.png"/>
 </center>

 %% Cell type:markdown id: tags:

 ### Using the wildcards

 **snakemake** uses the path of the files to define its execution order. Each rule uses the defined input files to produce the output files. A **wildcard** can be used as a placeholder to apply the logic of a rule to any number of files.

 ```
 rule index_bam:
    input:  "{sample}.bam"
 	output: "{sample}.bai"
 	shell:  "samtools index {input} {output}"
 ```
 Here, `{sample}` is a wildcard that can take any value. **Wildcards** are defined by the output and applied to the input.

 %% Cell type:markdown id: tags:

 ### Error recovery

 One of the aim of a workflow manager as **snakemake** is the possibility to create files as needed, which means avoiding rerunning the rules that have already been run. Thus, if part of the pipeline has already been run and an error occurs, the workflow manager has to be able to detect it, stop the execution of the pipeline and recovers this execution where it stopped in a later run.

 %% Cell type:markdown id: tags:

 <center>
 <img src="image_folder/snakeDAGIncomplete.png"/>
 </center>

 %% Cell type:markdown id: tags:

 ### File timestamps

 **snakemake** uses the file **timestamps** (creation or modification time) to check if a rule needs to run or not. Thus, the files created by **snakemake** need to have a timestamp according to the execution order.

 %% Cell type:markdown id: tags:

 ```
 $ snakemake -j 2

 Building DAG of jobs...
 Nothing to be done.
 ```

 %% Cell type:markdown id: tags:

 ```
 $ touch data/calling/mutations.bcf
 $ snakemake -j 2 -nr

 Building DAG of jobs...
 Job counts:
 	count	jobs
 	1	all
 	1	annotVCF
 	1	callVariations
 	3

 [Mon Dec  7 10:10:36 2020]
 rule callVariations:
    input: data/calling/mutations.bcf
    output: data/calling/results.vcf.gz
    jobid: 2
    reason: Updated input files: data/calling/mutations.bcf


 [Mon Dec  7 10:10:36 2020]
 rule annotVCF:
    input: data/genome/ref.fa, data/genome/annotations.gff.gz, data/calling/results.vcf.gz
    output: data/calling/results.annot.vcf.gz
    jobid: 1
    reason: Input files updated by another job: data/calling/results.vcf.gz


 [Mon Dec  7 10:10:36 2020]
 localrule all:
    input: data/calling/results.annot.vcf.gz
    jobid: 0
    reason: Input files updated by another job: data/calling/results.annot.vcf.gz

 Job counts:
 	count	jobs
 	1	all
 	1	annotVCF
 	1	callVariations
 	3
 This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
 ```

 %% Cell type:markdown id: tags:

 ### Using conda environments

 **snakemake** can define conda environments in each rule using a conda recipe.

 ```
 rule index_bam:
    input:  "{sample}.bam"
    output: "{sample}.bai"
    conda:  "envs/samtools.yaml"
    shell:  "samtools index {input} {output}"
 ```

 %% Cell type:markdown id: tags:

 ```
 $ cat envs/samtools.yaml

 name: samtools
 channels:
  - bioconda
  - conda-forge
 dependencies:
  - samtools=1.19
 ```

 %% Cell type:markdown id: tags:

 Give **snakemake** the instructions to use conda environments on the command line:

 ```
 snakemake --use-conda -j 2
 ```

 **snakemake** will automatically install all conda environments. As long as the recipe stays unchanged, **snakemake** will not reinstall the environment.

 %% Cell type:markdown id: tags:

 ### Interesting snakemake options

 All **snakemake** options are available by requesting the help on the command line:

 ```
 snakemake -h
 ```

 %% Cell type:markdown id: tags:

 ```
 --dry-run, --dryrun, -n
                        Do not execute anything, and display what would be
                        done. If you have a very large workflow, use --dry-run
                        --quiet to just print a summary of the DAG of jobs.
                        (default: False)
 ```

 %% Cell type:markdown id: tags:

 ```
 --snakefile FILE, -s FILE
                        The workflow definition in form of a
                        snakefile.Usually, you should not need to specify
                        this. By default, Snakemake will search for
                        'Snakefile', 'snakefile', 'workflow/Snakefile',
                        'workflow/snakefile' beneath the current working
                        directory, in this order. Only if you definitely want
                        a different layout, you need to use this parameter.
                        (default: None)
 ```

 %% Cell type:markdown id: tags:

 ```
 --cores [N], --jobs [N], -j [N]
                        Use at most N CPU cores/jobs in parallel. If N is
                        omitted or 'all', the limit is set to the number of
                        available CPU cores. (default: None)
 ```

 %% Cell type:markdown id: tags:

 ```
 --configfile FILE [FILE ...], --configfiles FILE [FILE ...]
                        Specify or overwrite the config file of the workflow
                        (see the docs). Values specified in JSON or YAML
                        format are available in the global config dictionary
                        inside the workflow. Multiple files overwrite each
                        other in the given order. (default: None)
 ```

 %% Cell type:markdown id: tags:

 ```
 --touch, -t           Touch output files (mark them up to date without
                        really changing them) instead of running their
                        commands. This is used to pretend that the rules were
                        executed, in order to fool future invocations of
                        snakemake. Fails if a file does not yet exist. Note
                        that this will only touch files that would otherwise
                        be recreated by Snakemake (e.g. because their input
                        files are newer). For enforcing a touch, combine this
                        with --force, --forceall, or --forcerun. Note however
                        that you loose the provenance information when the
                        files have been created in realitiy. Hence, this
                        should be used only as a last resort. (default: False)
 ```

 %% Cell type:markdown id: tags:

 ```
 --reason, -r          Print the reason for each executed rule. (default:
                        False)
 ```

 %% Cell type:markdown id: tags:

 ```
 --printshellcmds, -p  Print out the shell commands that will be executed.
                        (default: False)
 ```

 %% Cell type:markdown id: tags:

 ### Scaling up - using a cluster

 Another aim of a worflow manager as **snakemake** is to be able to scale the jobs to run according to any type of computing resources available, including a cluster.
 In order to run snakemake on our [BiRD cluster](https://pf-bird.univ-nantes.fr/resources/birdcluster/register-for-a-new-account), the option `--cluster` needs to used.

 ```
 snakemake --cluster "qsub -V -cwd -e ./logs/ -o ./logs/" --latency-wait 30
 ```

 %% Cell type:markdown id: tags:

 ## Links

 FAIR course from IFB:
 [https://fair-bioinfo.gitbook.io/fair-bioinfo/](https://fair-bioinfo.gitbook.io/fair-bioinfo/)
 Snakemake documentation (and tutorial):
 [https://snakemake.readthedocs.io/en/stable/index.html](https://snakemake.readthedocs.io/en/stable/index.html)

 %% Cell type:markdown id: tags:

 ## Execution of a snakemake example in a terminal
+
+%% Cell type:markdown id: tags:
+
+# Questions?