Jun 12, 2020 Re-process Salmon Methylome Data

redo trimming

I’m re-doing the trimming on the Salmon BS data with 10bp clipped off each end of the read based on bismark recommendations

excerpt from code:

# Salmon samples
reads_dir="/gscratch/srlab/strigg/data/Ssalar/FASTQS/RAW/"

find ${reads_dir}*_R1_001.fastq.gz | \
xargs basename -s _R1_001.fastq.gz | \
xargs -I{} /gscratch/srlab/strigg/bin/anaconda3/bin/trim_galore \
--cores 8 \
--output_dir /gscratch/scrubbed/strigg/analyses/20200427/TG_PE_FASTQS \
--paired \
--fastqc_args \
"--outdir /gscratch/scrubbed/strigg/analyses/20200427/TG_PE_FASTQS/FastQC \
--threads 28" \
--clip_R1 10 \
--clip_R2 10 \
--three_prime_clip_R1 10 \
--three_prime_clip_R2 10 \
--path_to_cutadapt /gscratch/srlab/strigg/bin/anaconda3/bin/cutadapt \
${reads_dir}{}_R1_001.fastq.gz \
${reads_dir}{}_R2_001.fastq.gz 

script: 20200427_TG\2_SalmoCalig.sh
slurm file: slurm-2535023.out
output: https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20200427/TG_PE_FASTQS/

Concatenate FQs from different lanes

script: 20200612_CatTGlanes_Salmo.sh
slurm file: slurm-2706792.out
output: https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20200612/

alignment score thresholding

I initially tested alignment score thresholds:

L,0,-0.2
L,0,-0.6
L,0,-1
L,0,-1.2
L,0,-1 with insert size min of 60
L,0,-2
Summary here: BmrkAlignementScoreComparison.html
journal post here: https://shellywanamaker.github.io/146th-post/

Because L,0,-0.6 shows ~ 4% CHH methyation, which means ~96% conversion efficiency which is a low estimate (normally it should be 99%), I am now interested in testing:

L,0,-0.3
L,0,-0.4
L,0,-0.5

I was interested to try both single and paired end alignment at these more stringent thresholds on a subset of reads (100000) for a few samples (randomly selected 16C_32psu_4_S4, 16C_26psu_1_S13, CTRL_8C_26psu_1_S17, 8C_32psu_3_S7, and 8C_26psu_2_S10)

Analysis

The following two analyses were performed on mox. Bismark was run on contatenated newly trimmed BS data using alignment score thresholds L,0,-0.2; L,0,-0.3; L,0,-0.4; L,0,-0.5; and L,0,-0.6

Paired end alignments

script: 20200612_BmrkAlnAStest_Salmo.sh
slurm file: slurm-2706798.out
output: https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20200612/AS_0.2/

Single end alignments

script: 20200612_BmrkAlnAStest_Salmo_SE.sh
slurm file: slurm-2706898.out
output:
- AS_0.2
- AS_0.3
- AS_0.4
- AS_0.5
- AS_0.6

Results

R Script: 20200612_AS_compare.Rmd
input data: I created the following data files from the bismark reports in the output folders (e.g. 8C_26psu_2_S10_R1_001_val_1_bismark_bt2_PE_report.txt):
- 20200612_AS_compare.csv
- 20200612_AS_compare_SE.csv

In choosing an alignment score, I want to balance the highest % mapping with the lowest % CHH and % CHG methylation, and the highest % mCpG methylation.

Single end alignments tend give higher % mapping than paired end. But this comes at the cost of increased % mCHH and decreased % mCpG which suggest SE mapping is likely not contributing to high integrity alignments. Because the only CHH methylation we are expecting is from incomplete bisulfite conversion, which is typically <= 1%. This is consistent with my previous single end alignment analysis).

Notice in the % mCHH plot how the boxplots tend to show an exponential increase in % methylation. Alignment score threshold of L,0,-0.3 seems to be the last point before the increase shoots up while still staying close to the 99% conversion efficiency rate that’s typically expected.

Conclusions

I will use paired end alignment mode with an alignment score threshold of L,0,-0.3 to maximize % mapping and % mCpG methylation, and minimize % CHH methylation.