Commit d84bc10a authored by Pierre LINDENBAUM's avatar Pierre LINDENBAUM
Browse files

gff

parent 391c7791
......@@ -770,9 +770,95 @@ Le contenu d'un fichier VCF sera étudié en détail lors de la prochaine séanc
## Annotez le fichier VCF
Utilisez ENSEMBL Variant Effect Predictor https://rest.ensembl.org/documentation/info/vep_region_get pour annoter votre fichier VCF (conséquences sur la protéine, nom du gène, etc...). Choisissez la version du génome correspondant à votre génome de référence (GRCh38 ou GRCh37).
Nous allons utiliser bcftools csq pour annoter le VCF:
### Synopsis
```
About: Haplotype-aware consequence caller.
Usage: bcftools csq [options] in.vcf
Required options:
-f, --fasta-ref <file> reference file in fasta format
-g, --gff-annot <file> gff3 annotation file
CSQ options:
-b, --brief-predictions annotate with abbreviated protein-changing predictions
-c, --custom-tag <string> use this tag instead of the default BCSQ
-l, --local-csq localized predictions, consider only one VCF record at a time
-n, --ncsq <int> maximum number of consequences to consider per site [16]
-p, --phase <a|m|r|R|s> how to handle unphased heterozygous genotypes: [r]
a: take GTs as is, create haplotypes regardless of phase (0/1 -> 0|1)
m: merge *all* GTs into a single haplotype (0/1 -> 1, 1/2 -> 1)
r: require phased GTs, throw an error on unphased het GTs
R: create non-reference haplotypes if possible (0/1 -> 1|1, 1/2 -> 1|2)
s: skip unphased hets
Options:
-e, --exclude <expr> exclude sites for which the expression is true
--force run even if some sanity checks fail
-i, --include <expr> select sites for which the expression is true
--no-version do not append version and command line to the header
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v|t> b: compressed BCF, u: uncompressed BCF, z: compressed VCF
v: uncompressed VCF, t: plain tab-delimited text output [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --samples <-|list> samples to include or "-" to apply all variants and ignore samples
-S, --samples-file <file> samples to include
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
--threads <int> use multithreading with <int> worker threads [0]
-v, --verbose <int> verbosity level 0-2 [1]
Example:
bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf
# GFF3 annotation files can be downloaded from Ensembl. e.g. for human:
ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/
ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/
```
Téléchargez le fichier d'annotation gff pour le chromosome chr22:
Si vous travaillez sur hg38:
```
wget -O annot.gff.gz "ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/Homo_sapiens.GRCh38.101.chromosome.22.gff3.gz"
```
si vous travaillez sur hg19:
```
wget -O annot.gff.gz ftp://ftp.ensembl.org/pub/grch37/release-99/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.chromosome.22.gff3.gz"
```
Les fichiers GFF utilisent la notation '22' alors que nos chromosomes utilisent 'chr22'. Nous allons donc ajouter ce prefix avec `sed` pour les lignes qui commencent par '22'
```
gunzip -c annot.gff.gz | sed 's/^22/chr22/' | gzip > annotations.gff.gz
```
et nous annotons le fichier vcf avec le fichier de gènes GFF avec la command `bcftools csq`
```
bcftools csq --phase a --fasta-ref ref.fa --gff-annot annot.gff.gz -O b -o result.annot.vcf.gz result.vcf.gz
```
```
bcftools view result.annot.vcf.gz
(...)
chr22 22712467 . T G 730 . DP=332;VDB=0.784239;SGB=-2.07845;RPB=0.40
8006;MQB=0.451512;MQSB=0.413822;BQB=0.0195238;MQ0F=0.0180723;ICB=0.128205;HOB=0.0555556;AC=5;AN=6;DP4=9,3
0,108,129;MQ=51;BCSQ=missense|IGLV1-47|ENST00000390294|IG_V|+|70S>70R|22712467T>G GT:PL:DP:AD:GP:GQ
:BCSQ 1/1:255,255,0:107:2,105:269,259,0:127:3 0/1:255,0,255:81:36,45:265,0,250:127:2 1/1:255,245,0:88:
1,87:269,249,0:127:3
(...)
```
Si il vous reste du temps, cherchez dans le fichier VCF les variations de-novo chez l'enfant.
## Makefile
......
......@@ -16,7 +16,17 @@ $(1).sam : DATA/$(1).R1.fq.gz DATA/$(1).R2.fq.gz ref.fa.bwt
endef
all: result.vcf.gz.tbi
all: result.annot.vcf.gz.tbi
result.annot.vcf.gz.tbi: result.annot.vcf.gz
bcftools index $<
result.annot.vcf.gz: result.vcf.gz annot.gff.gz ref.fa.fai
bcftools csq --phase a --fasta-ref ref.fa --gff-annot annot.gff.gz -O b -o $@ $<
annot.gff.gz:
wget -O - "ftp://ftp.ensembl.org/pub/grch37/release-99/gff3/homo_sapiens/Homo_sapiens.GRCh37.87.chromosome.22.gff3.gz" | gunzip -c | sed 's/^22/chr22/' | gzip > $@
result.vcf.gz.tbi: result.vcf.gz
tabix --force --preset vcf $<
......@@ -25,7 +35,7 @@ result.vcf.gz: result.bcf
bcftools call --multiallelic-caller --variants-only --output-type z -o $@ --format-fields GQ,GP $<
result.bcf : $(addsuffix .bam.bai,child father mother) $(addsuffix .bam,child father mother) ref.fa.fai
bcftools mpileup -g --uncompressed --output-tags DP --fasta-ref ref.fa --output $@ $(filter %.bam,$^)
bcftools mpileup --fasta-ref ref.fa --output-type b --output $@ -a 'FORMAT/AD' -a 'FORMAT/DP' $(filter %.bam,$^)
$(eval $(call mapreads,father))
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment