Contavect was developped to quantify and caracterize DNA contaminants from gene therapy vector production after NGS sequencing. This automated pipeline can however be used for wider pourpose requiring to identify map NGS datasets consisting of a mix of DNA sequences on multiple references. It combine several features such as reference homologies masking, fastq filtering/adapter trimming, short read alignments, SAM file splitting and generating human readable output.
Contavect was developed to quantify and characterize DNA contaminants from gene therapy vector production after NGS sequencing. This automated pipeline can however be used for wider purpose requiring to identify map NGS datasets consisting of a mix of DNA sequences on multiple references. It combine several features such as reference homologies masking, fastq filtering/adapter trimming, short read alignments, SAM file splitting and generating human readable output.
##Principle
##Principle
Contavect a python pipeline composed of several modules linked together to analyse NGS Data. Here is a description of the overall workflow principle :
Contavect a python pipeline composed of several modules linked together to analyze NGS Data. Here is a description of the overall workflow principle :
1. Each reference fasta file is parsed to identify all sequences within it and a Reference object is initialised to save the reference characteristics, the name and the output required.
1. Each reference fasta file is parsed to identify all sequences within it and a Reference object is initialized to save the reference characteristics, the name and the output required.
2. Facultative: Homologies between references can be masked iteratively, starting by the last reference which is masked by all the others then to the penultimate masked by all others except the last and and so forth until there is only 1 reference remaining. This is done using blastn from blast+ package
2. Facultative: Homologies between references can be masked iteratively, starting by the last reference which is masked by all the others then to the penultimate masked by all others except the last and and so forth until there is only 1 reference remaining. This is done using blastn from blast+ package
3. Facultative: Fastq can be filtered by mean quality and adapters can be trimmed using an homemade fully integrated fastq filter parallel procecing module written in python and C.
3. Facultative: Fastq can be filtered by mean quality and adapters can be trimmed using an homemade fully integrated fastq filter parallel processing module written in python and C.
4. If needed an index for bwa will be generated from the modified reference files or from the original one after being merged together in a temporary directory. Then Fastq sequences are then aligned against the bwa merged reference genome index and a temporary sam file is generated
4. If needed an index for bwa will be generated from the modified reference files or from the original one after being merged together in a temporary directory. Then Fastq sequences are then aligned against the bwa merged reference genome index and a temporary sam file is generated
5. Aligned reads from the sam file are splited and attributed to the reference Object for which a hit was found. or to one of the following garbage reads categories: unmapped, lowMapq, secondary.
5. Aligned reads from the sam file are spitted and attributed to the reference Object for which a hit was found. or to one of the following garbage reads categories: unmaped, lowMapq, secondary.
6. Each reference will then generates the output required in the configuration file (Bam, sam, bedgraph, bed, covgraph and/or variant report).
6. Each reference will then generates the output required in the configuration file (Bam, sam, bedgraph, bed, covgraph and/or variant report).
7. Finally distribution reports and a log file are generated
7. Finally distribution reports and a log file are generated


For more information, a comprehensive developper documentation can be generated from ContaVect.dox using [Doxygen](https://github.com/doxygen/doxygen) with [doxypy](https://github.com/0xCAFEBABE/doxypy).
For more information, a comprehensive developer documentation can be generated from ContaVect.dox using [Doxygen](https://github.com/doxygen/doxygen) with [doxypy](https://github.com/0xCAFEBABE/doxypy).
## Dependencies:
## Dependencies:
The programm was developed under Linux Mint 16/17 and require a python 2.7 environment.
The program was developed under Linux Mint 16/17 and require a python 2.7 environment.
The following dependencies are required for proper program excecution:
The following dependencies are required for proper program execution:
2. Enter the root of the program folder and make the main script excecutable
2. Enter the root of the program folder and make the main script executable
``` bash
``` bash
$ sudo chmod u+x ContaVect.py
$ sudo chmod u+x ContaVect.py
```
```
3.Compile the ssw aligner (and add the dynamic library it to the path)
3.Compile the ssw aligner (and add the dynamic library it to the path)
If you wish to perform a step of adapter trimming before mapping you need to complile the dynamic library ssw.so to be able to use the Smith watermann algorithm forked from mengyao's [Complete-Striped-Smith-Waterman-Library](https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library)
If you wish to perform a step of adapter trimming before mapping you need to complile the dynamic library ssw.so to be able to use the Smith Waterman algorithm forked from mengyao's [Complete-Striped-Smith-Waterman-Library](https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library)
To use the dynamic library libssw.so you may need to modify the LD_LIBRARY_PATH environment
To use the dynamic library libssw.so you may need to modify the LD_LIBRARY_PATH environment
variable to include the library directory (export LD_LIBRARY_PATH=$PWD) or for definitive
variable to include the library directory (export LD_LIBRARY_PATH=$PWD) or for definitive