redo trimming
I’m re-doing the trimming on the Salmon BS data with 10bp clipped off each end of the read based on bismark recommendations
excerpt from code:
# Salmon samples
reads_dir="/gscratch/srlab/strigg/data/Ssalar/FASTQS/RAW/"
find ${reads_dir}*_R1_001.fastq.gz | \
xargs basename -s _R1_001.fastq.gz | \
xargs -I{} /gscratch/srlab/strigg/bin/anaconda3/bin/trim_galore \
--cores 8 \
--output_dir /gscratch/scrubbed/strigg/analyses/20200427/TG_PE_FASTQS \
--paired \
--fastqc_args \
"--outdir /gscratch/scrubbed/strigg/analyses/20200427/TG_PE_FASTQS/FastQC \
--threads 28" \
--clip_R1 10 \
--clip_R2 10 \
--three_prime_clip_R1 10 \
--three_prime_clip_R2 10 \
--path_to_cutadapt /gscratch/srlab/strigg/bin/anaconda3/bin/cutadapt \
${reads_dir}{}_R1_001.fastq.gz \
${reads_dir}{}_R2_001.fastq.gz
- script: 20200427_TG\2_SalmoCalig.sh
- slurm file: slurm-2535023.out
- output: https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20200427/TG_PE_FASTQS/
Concatenate FQs from different lanes
- script: 20200612_CatTGlanes_Salmo.sh
- slurm file: slurm-2706792.out
- output: https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20200612/
alignment score thresholding
I initially tested alignment score thresholds:
- L,0,-0.2
- L,0,-0.6
- L,0,-1
- L,0,-1.2
- L,0,-1 with insert size min of 60
-
L,0,-2
- Summary here: BmrkAlignementScoreComparison.html
- journal post here: https://shellywanamaker.github.io/146th-post/
Because L,0,-0.6 shows ~ 4% CHH methyation, which means ~96% conversion efficiency which is a low estimate (normally it should be 99%), I am now interested in testing:
- L,0,-0.3
- L,0,-0.4
- L,0,-0.5
I was interested to try both single and paired end alignment at these more stringent thresholds on a subset of reads (100000) for a few samples (randomly selected 16C_32psu_4_S4, 16C_26psu_1_S13, CTRL_8C_26psu_1_S17, 8C_32psu_3_S7, and 8C_26psu_2_S10)
Analysis
The following two analyses were performed on mox. Bismark was run on contatenated newly trimmed BS data using alignment score thresholds L,0,-0.2; L,0,-0.3; L,0,-0.4; L,0,-0.5; and L,0,-0.6
Paired end alignments
- script: 20200612_BmrkAlnAStest_Salmo.sh
- slurm file: slurm-2706798.out
- output: https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20200612/AS_0.2/
Single end alignments
- script: 20200612_BmrkAlnAStest_Salmo_SE.sh
- slurm file: slurm-2706898.out
- output:
Results
- R Script: 20200612_AS_compare.Rmd
- input data: I created the following data files from the bismark reports in the output folders (e.g. 8C_26psu_2_S10_R1_001_val_1_bismark_bt2_PE_report.txt):
In choosing an alignment score, I want to balance the highest % mapping with the lowest % CHH and % CHG methylation, and the highest % mCpG methylation.
Single end alignments tend give higher % mapping than paired end. But this comes at the cost of increased % mCHH and decreased % mCpG which suggest SE mapping is likely not contributing to high integrity alignments. Because the only CHH methylation we are expecting is from incomplete bisulfite conversion, which is typically <= 1%. This is consistent with my previous single end alignment analysis).
Notice in the % mCHH plot how the boxplots tend to show an exponential increase in % methylation. Alignment score threshold of L,0,-0.3 seems to be the last point before the increase shoots up while still staying close to the 99% conversion efficiency rate that’s typically expected.
Conclusions
- I will use paired end alignment mode with an alignment score threshold of L,0,-0.3 to maximize % mapping and % mCpG methylation, and minimize % CHH methylation.