**snakemake** allows to define the logic (steps) of the workflow and apply this logic to any number of samples.
%% Cell type:markdown id: tags:
### The Snakefile
All rules are defined into a file written in the python programming language.
By default, **snakemake** will search for a file called `Snakefile` in the directory where it is executed but any file can be used as long as the snakemake rule definition are well written.
In order to execute the jobs written in a `Snakefile`, all that is needed is:
```
snakemake -j X
```
where X is the number of parallel jobs to be run.
%% Cell type:markdown id: tags:
### Execution order of rules
**snakemake** is a "top to bottom" worflow manager:
- the first rule lists the target files to create
- the execution order will then be determined from the input and output files of each rule
**snakemake** will parallelize every step whenever it can.
<center>
<imgsrc="image_folder/snakeParallelization.png"/>
</center>
%% Cell type:markdown id: tags:
### Vizualizing the parallelization
In our data, we have 3 different individuals and thus, some steps can be parallelized.
```
snakemake --dag | dot -Tpng > snakeDAG.png
```
%% Cell type:markdown id: tags:
<center>
<imgsrc="image_folder/snakeDAG.png"/>
</center>
%% Cell type:markdown id: tags:
### Using the wildcards
**snakemake** uses the path of the files to define its execution order. Each rule uses the defined input files to produce the output files. A **wildcard** can be used as a placeholder to apply the logic of a rule to any number of files.
```
rule index_bam:
input: "{sample}.bam"
output: "{sample}.bai"
shell: "samtools index {input} {output}"
```
Here, `{sample}` is a wildcard that can take any value. **Wildcards** are defined by the output and applied to the input.
%% Cell type:markdown id: tags:
### Error recovery
One of the aim of a workflow manager as **snakemake** is the possibility to create files as needed, which means avoiding rerunning the rules that have already been run. Thus, if part of the pipeline has already been run and an error occurs, the workflow manager has to be able to detect it, stop the execution of the pipeline and recovers this execution where it stopped in a later run.
%% Cell type:markdown id: tags:
<center>
<imgsrc="image_folder/snakeDAGIncomplete.png"/>
</center>
%% Cell type:markdown id: tags:
### File timestamps
**snakemake** uses the file **timestamps** (creation or modification time) to check if a rule needs to run or not. Thus, the files created by **snakemake** need to have a timestamp according to the execution order.
reason: Input files updated by another job: data/calling/results.vcf.gz
[Mon Dec 7 10:10:36 2020]
localrule all:
input: data/calling/results.annot.vcf.gz
jobid: 0
reason: Input files updated by another job: data/calling/results.annot.vcf.gz
Job counts:
count jobs
1 all
1 annotVCF
1 callVariations
3
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
```
%% Cell type:markdown id: tags:
### Using conda environments
**snakemake** can define conda environments in each rule using a conda recipe.
```
rule index_bam:
input: "{sample}.bam"
output: "{sample}.bai"
conda: "envs/samtools.yaml"
shell: "samtools index {input} {output}"
```
%% Cell type:markdown id: tags:
```
$ cat envs/samtools.yaml
name: samtools
channels:
- bioconda
- conda-forge
dependencies:
- samtools=1.19
```
%% Cell type:markdown id: tags:
Give **snakemake** the instructions to use conda environments on the command line:
```
snakemake --use-conda -j 2
```
**snakemake** will automatically install all conda environments. As long as the recipe stays unchanged, **snakemake** will not reinstall the environment.
%% Cell type:markdown id: tags:
### Interesting snakemake options
All **snakemake** options are available by requesting the help on the command line:
```
snakemake -h
```
%% Cell type:markdown id: tags:
```
--dry-run, --dryrun, -n
Do not execute anything, and display what would be
done. If you have a very large workflow, use --dry-run
--quiet to just print a summary of the DAG of jobs.
(default: False)
```
%% Cell type:markdown id: tags:
```
--snakefile FILE, -s FILE
The workflow definition in form of a
snakefile.Usually, you should not need to specify
this. By default, Snakemake will search for
'Snakefile', 'snakefile', 'workflow/Snakefile',
'workflow/snakefile' beneath the current working
directory, in this order. Only if you definitely want
a different layout, you need to use this parameter.
(default: None)
```
%% Cell type:markdown id: tags:
```
--cores [N], --jobs [N], -j [N]
Use at most N CPU cores/jobs in parallel. If N is
omitted or 'all', the limit is set to the number of
Specify or overwrite the config file of the workflow
(see the docs). Values specified in JSON or YAML
format are available in the global config dictionary
inside the workflow. Multiple files overwrite each
other in the given order. (default: None)
```
%% Cell type:markdown id: tags:
```
--touch, -t Touch output files (mark them up to date without
really changing them) instead of running their
commands. This is used to pretend that the rules were
executed, in order to fool future invocations of
snakemake. Fails if a file does not yet exist. Note
that this will only touch files that would otherwise
be recreated by Snakemake (e.g. because their input
files are newer). For enforcing a touch, combine this
with --force, --forceall, or --forcerun. Note however
that you loose the provenance information when the
files have been created in realitiy. Hence, this
should be used only as a last resort. (default: False)
```
%% Cell type:markdown id: tags:
```
--reason, -r Print the reason for each executed rule. (default:
False)
```
%% Cell type:markdown id: tags:
```
--printshellcmds, -p Print out the shell commands that will be executed.
(default: False)
```
%% Cell type:markdown id: tags:
### Scaling up - using a cluster
Another aim of a worflow manager as **snakemake** is to be able to scale the jobs to run according to any type of computing resources available, including a cluster.
In order to run snakemake on our [BiRD cluster](https://pf-bird.univ-nantes.fr/resources/birdcluster/register-for-a-new-account), the option `--cluster` needs to used.