Eric CHARPENTIER · 7a838473
--- a/analysis/primary_analysis.md
+++ b/analysis/primary_analysis.md
+[*Back to home*](Home)  
+*Previous: [Testing the installation](usage/run_pipeline)*
+
 # Primary analysis

 The primary analysis is composed of the steps to go from the raw fastq files generated by the sequencing to the expression matrix containing the raw counts for each sample and for each gene.

-![fastq](uploads/9855695405d57446ffa0b7dc659ce0d3/fastq.png)
-
+![overview](uploads/9855695405d57446ffa0b7dc659ce0d3/fastq.png "Overview of the primary analysis pipeline")

 ## Demultipexing

+The demultiplexing step consists of reassigning each read in the raw fastq file to its sample according to the sample barcode. This step is achieved by a python script named `split_fastq.py`. This step consumes quite a lot of resources:

+ - reads coming from the raw fastq files are stored in **memory** before being dumped on every sample fastq files.
+ - there are as many **threads** created as the number of splitted raw fastq files. For example, if you have followed the procedure for splitting the raw fasts file as in [the inputs page](usage/reference#fastqFile), and you end up with 30 splitted pairs of fastq, then 30 parallel jobs will be created.

+Once all raw fastq files have been parsed and each read is stored in the good place in a python dictionnary, the reads are dumped in new fastq files (one per sample) in the **FASTQ** folder under the appropriate project folder. The sample barcode, as well as the UMI sequence, are incorporated in the name of each read.

+## Trimming

+A step of trimming all reads is performed in order to remove any poly A tails present (a sequence of at least 10 'A's). This can happen if the transposase has cut the DNA fragment to close to the poly A tail. By using cutadapt on these fragments, we also get rid of all the undesired bases after the poly A tails (reverse sequences of the UMI, sample barcode and illumina adapater). The newly generated fastq files are stored in the **CUTADAPT** folder. Old fastq files in the **FASTQ** folder are removed.

 ## Alignment

-## UMI counting
+The alignment step is performed by BWA. Since we're using a **reference transcriptome**, we don't need to use a RNA aligner (such as STAR or bowtie). We're using BWA with non standard arguments:
+
+```sh
+bwa aln -M 0 {ref} {fastq} | bwa samse -n {maxNumberTranscript} | samtools sort
+```
+
+ - The `-M` option of `bwa aln` ensures BWA will not search for suboptimal hits. We only want to keep the best hits.  
+ - The `-n` option of `bwa samse` will limit the maximum number of alignments to the maximum number of transcripts in a single gene. This reduces the size of the alignment files and ensures that reads mapping on multiple genes are taken into account. No need to print all the secondary alignments in the bam file. If there is more alignment location than the maximum of transcript for a single gene, then it means the read has aligned on multiple genes.
+
+Alignment files are stored in the **ALIGNMENT** folder. They are also indexed with `samtools index` if further analysis on the bam files is needed (e.g. IGV visualization).
+
+## Expression matrix
+
+Finally, the bam files are parsed in python script `count.py` with the help of the **pysam** module. The aim is to count each molecule of mRNA captured, with the help of the UMI's, for every gene in every sample, and thus, create an expression matrix. 
+
+Reads are removed if one of the following cases occur:
+ - its UMI contains N's in its sequence.
+ - there are more than 3 mismatch with the best alignement.
+ - if the sequence contains a poly A pattern (more than 10 'A's in a row). This shouldn't happen since the implementation of cutadapt.
+ - if aligned at too many position. This shouldn't happen since the use of the arguments in bwa.
+
+Multiple files are created from the execution of this script.  
+The name of the files start with the name of the project.  
+The second part is either "all" or "unq":
+ - "all" counts reads that have aligned on multiple genes. The read count is assigned to the gene defined as the primary alignement in the bam files. There is no other reason than random to assign the read to a particular transcript as primary alignment even though the secondary alignments are all viable solutions.
+ - "unq" counts only the reads which align on a single gene. Reads can align on multiple transcripts but of only one gene.
+
+The third part of the name defines the informations to be found:
+ - "log" is a summary of the counts for the whole project.
+ - "refseq" is the expression matrix. It is there that you will find all counts for each sample and for each gene.
+ - "spike" is the count matrix of the spike-in. It is there for legacy reasons and is not used in our implementation of the technique.
+ - "unknown_list" is the list of the tracks present in the fasta file (ie, the list of transcript) which are not assigned to a gene in the sym2ref file (ie, no annotation).
+ - "well_summary" is a summary of the counts for each sample. The number of read assigned to the sample, aligned, aligned on refseq sequences and the number of unique UMIs can be found in this file.
+
+ The fourth part of the name, only found in "refseq" and "spike" files can either be "total" or "umi":
+ - "total" is the number of total reads of the counts without taking into account the UMIs. Counts can be artificially increased by the PCR steps of the technique.
+ - "umi" is the number of unique molecules of RNA, this taking into account the UMIs for each read. If two reads of the same sample map to the same gene and have the same UMI, it will only be counted for 1.
+
+ **Thus, the expression matrix usually used for secondary analysis is "XXX.unq.refseq.umi.dat"**
+
+ ![expression matrix](../images/refseqUmi.png "Example of the expression matrix unq.refseq.umi.dat")
+
+All files resulting from this script, expression matrix, summaries etc... are stored in the **EXPRESSION** folder.

-# Secondary analysis
-## Filtering
\ No newline at end of file
+<div style="text-align: right">
+<i>Next: <a href="analysis/secondary_analysis">Secondary analysis</a></i>
+</div>
\ No newline at end of file