redo trimming

I’m re-doing the trimming on the Salmon BS data with 10bp clipped off each end of the read based on bismark recommendations

excerpt from code:

# Salmon samples
reads_dir="/gscratch/srlab/strigg/data/Ssalar/FASTQS/RAW/"

find ${reads_dir}*_R1_001.fastq.gz | \
xargs basename -s _R1_001.fastq.gz | \
xargs -I{} /gscratch/srlab/strigg/bin/anaconda3/bin/trim_galore \
--cores 8 \
--output_dir /gscratch/scrubbed/strigg/analyses/20200427/TG_PE_FASTQS \
--paired \
--fastqc_args \
"--outdir /gscratch/scrubbed/strigg/analyses/20200427/TG_PE_FASTQS/FastQC \
--threads 28" \
--clip_R1 10 \
--clip_R2 10 \
--three_prime_clip_R1 10 \
--three_prime_clip_R2 10 \
--path_to_cutadapt /gscratch/srlab/strigg/bin/anaconda3/bin/cutadapt \
${reads_dir}{}_R1_001.fastq.gz \
${reads_dir}{}_R2_001.fastq.gz 

Concatenate FQs from different lanes

alignment score thresholding

I initially tested alignment score thresholds:

Because L,0,-0.6 shows ~ 4% CHH methyation, which means ~96% conversion efficiency which is a low estimate (normally it should be 99%), I am now interested in testing:

  • L,0,-0.3
  • L,0,-0.4
  • L,0,-0.5

I was interested to try both single and paired end alignment at these more stringent thresholds on a subset of reads (100000) for a few samples (randomly selected 16C_32psu_4_S4, 16C_26psu_1_S13, CTRL_8C_26psu_1_S17, 8C_32psu_3_S7, and 8C_26psu_2_S10)

Analysis

The following two analyses were performed on mox. Bismark was run on contatenated newly trimmed BS data using alignment score thresholds L,0,-0.2; L,0,-0.3; L,0,-0.4; L,0,-0.5; and L,0,-0.6

Paired end alignments

Single end alignments

Results

In choosing an alignment score, I want to balance the highest % mapping with the lowest % CHH and % CHG methylation, and the highest % mCpG methylation.

Single end alignments tend give higher % mapping than paired end. But this comes at the cost of increased % mCHH and decreased % mCpG which suggest SE mapping is likely not contributing to high integrity alignments. Because the only CHH methylation we are expecting is from incomplete bisulfite conversion, which is typically <= 1%. This is consistent with my previous single end alignment analysis).

Notice in the % mCHH plot how the boxplots tend to show an exponential increase in % methylation. Alignment score threshold of L,0,-0.3 seems to be the last point before the increase shoots up while still staying close to the 99% conversion efficiency rate that’s typically expected.

Conclusions

  • I will use paired end alignment mode with an alignment score threshold of L,0,-0.3 to maximize % mapping and % mCpG methylation, and minimize % CHH methylation.