Eric CHARPENTIER · 5c418254
--- a/analysis/primary_analysis.md
+++ b/analysis/primary_analysis.md
+# Primary analysis
+
+The primary analysis is composed of the steps to go from the raw fastq files generated by the sequencing to the expression matrix containing the raw counts for each sample and for each gene.
+
+![fastq](uploads/9855695405d57446ffa0b7dc659ce0d3/fastq.png)
+
+## Cloning the git repository
+
+In order to use this analysis pipeline, you need to clone this repository. Note that this pipeline has been built to run on linux systems. It may not (probably will not) work on any other system.
+To clone this repository, **you need a working install of git**:
+
+```bash
+git clone "https://gitlab.univ-nantes.fr/bird_pipeline_registry/srp-pipeline.git"
+```
+
+All locations of files described in this wiki page will be assumed to be relative to the main "srp-pipeline" folder.
+
+## Creation of the conda virtual environment
+
+All tools necessary for the pipeline to run are listed within a **conda environment** recipe located in "CONDA/rna.yml", hence **you need a working install of conda**. If you do not have conda, you can get it from either:
+- the [miniconda](https://docs.conda.io/en/latest/miniconda.html) distribution
+- the [anaconda](https://www.anaconda.com/distribution/) distribution
+
+Due to the impossibility for conda to manage the dependencies of all the R packages needed, the conda environment is simplified to install only R and not the packages. All the R packages are installed after the conda environment by a R script, located in "CONDA/installDeEnv.R" via bioconductor.  
+You can manually run the creation of the conda environment, then run the R script to install R packages, or simply use the bash script "install_dependencies.sh":
+
+```bash
+./install_dependencies.sh
+```
+
+## Creation of the input files
+
+In order to run the pipeline, you will need two things:
+- a **samplesheet** describing your samples
+- a pair of **raw fastq files**
+
+### The samplesheet
+
+This file has to be a *tab delimited* file (without header) containing six columns:
+
+1. well : the well of the plate the sample was prepared in
+2. index : the corresponding barcode associated with the well
+3. name : the name of the sample (only alpha-numeric characters allowed)
+4. project : the project name. A single plate can be composed of multiple projects (only alpha-numeric characters allowed)
+5. condition : the condition the sample belongs to. Necessary to perform comparisons to find differentially expressed genes (only alpha-numeric characters allowed)
+6. species : the genome (only one genome per project) (ex: hg19, hg38, mm10, rn6 etc...)
+
+:page_facing_up: Create a samplesheet
+
+| | | | | | |
+| :--- | :---- | :---- | :--- | :---- | :---- |
+| A01  | AAAACT | KR_1 | ProjectX | ctl | hg19 |
+| A02  | AAAGTT | KR_2 | ProjectX | ctl | hg19 |
+| A03  | AAATTG | KR_3 | ProjectX | case | hg19 |
+| ...  | ...    | ...  | ... | ... | ... |
+| H12  | TGGCCG | SK_1 | ProjectY | conditionX | rn6 |
+
+> **Note:**
+
+> - The file should be **tab delimited** without trailing or leading spaces.
+> - Only characters **a-z**  **A-Z**  **-** **_** are allowed.
+> - There should be **no empty lines**.
+> - There should be **no header**.
+
+#### Filling correctly the samplesheet
+
+##### Wells and barcodes
+
+The first two columns are set by the experimental protocol used.  
+A file describing the wells with the associated barcodes used on the GenoBiRD platform is located in "SCRIPTS/barcode96.tsv.
+
+##### Sample names
+
+The sample names should only contains characters **a-z**, **A-Z**, **-** and **_**. No other characters will be allowed. They should be unique into a "project".
+
+##### Project
+
+A plate can be composed of multiple projects. Different projects will be analysed separately in different folders. Separating studies in different projects affects sample normalization and differentially expressed genes discovery. The project names should only contains characters **a-z**, **A-Z**, **-** and **_**.
+
+##### Conditions
+
+The conditions are the different groups of sample which must be compared to find the differentially expressed genes. They must be unique in a project and composed of at least two samples in order to be compared to other conditions. The condition names should only contains characters **a-z**, **A-Z**, **-** and **_**, and start with a letter.
+
+##### Species
+
+All reference transcriptome specified in column "species" will be downloaded if you specify a reference folder that doesn't contain these refs. Three files are needed to build the reference:
+ - refMrna containing all the refseq sequences
+ - chrM genomic sequence
+ - refGene containing all the refseq annotations
+
+Thoses references are downloaded from the ucsc ftp server. URLs are programmatically built so the files located here should exist:
+ - `http://hgdownload.cse.ucsc.edu/goldenPath/<species>/chromosomes/chrM.fa.gz`
+ - `http://hgdownload.cse.ucsc.edu/goldenPath/<species>/database/refGene.txt.gz`
+ - `http://hgdownload.cse.ucsc.edu/goldenPath/<species>/bigZips/refMrna.fa.gz`
+
+with \<species\> being the genome specified in the samplesheet.  
+If these files don't exist, you're going to have to build the reference by yourself.  
+A correct reference must contain a fasta file containing transcript sequences and a file (that we call "_sym2ref.dat") linking the name of the sequences in the fasta file to a gene name . Examples of this files are provided in the "TESTDATA/REFERENCES/hg19" folder. 
+
+### The fastq files
+
+The raw fastq files generated by the sequencing machine are made of a single pair of files, R1 and R2.
+- R1 contains the sequences for the sample barcode and for the unique molecule identifiers (UMIs).
+- R2 contains the sequences of the transcripts captured, each one tagged by a UMI.
+
+![lib](uploads/8e3ff62214d790d3268b67eabc179c61/lib.png)
+
+The first step consists in re-affecting every transcript sequence contained in the R2 file to the sample it belongs to by reading the first 6 bases of the sequences in the R1 file.  
+In order to go faster, since those files contain approximately 500M reads, the original raw fastq pair of files will be splitted in chunks of 4M reads (16M lines). This will be performed by the Makefile provided in the "SCRIPTS" directory of the project:
+
+```bash
+make -f SCRIPTS/Makefile -j 2 R1=/path/to/R1.fastq.gz R2=/path/to/R2.fastq.gz NJOBS=20 OUTDIR=/path/to/split/fastqFiles
+```
+
+## Demultipexing
+
+
+
+
+
+
+## Alignment
+
+## UMI counting
+
+# Secondary analysis
+## Filtering
\ No newline at end of file