Methylation analysis:

Geoduck alignments too slow, should be directional, and many chromosomal sequence extraction errors

  • increase processors/cores to use from default(4) to 28 (i.e. -p 28)

  • change to directional mapping because non-directional mapping should be 1:1:1:1 But instead the non-directional mapping shows ~ 10:1:1:10 as follows:
    CT/GA/CT: 2758 ((converted) top strand)
    GA/CT/CT: 350 (complementary to (converted) top strand)
    GA/CT/GA: 316 (complementary to (converted) bottom strand)
    CT/GA/GA: 2725 ((converted) bottom strand)

  • padded genome
    • strigg/analyses/20181025/slurm-401715.out shows > 1M errors: ‘Chromosomal sequence could not be extracted’
    • creating padded genome to see if this improves mapping
    • Steven can run scripts from coenv : sbatch -p coenv -A coenv 20181101_bmrkPgenrGenmPadded.sh *needed to make my directory writable (chmod 777)
  • tested effects of directional mapping and genome padding on mapping efficiency with one sample (103)
    • jupyter notebook
    • mapping reports to compare:
      • original alignment (non-directional mapping, aligned to non-padded genome)
        • /Volumes/web/metacarcinus/Pgenerosa/20181025/EPI-103_S27_L005_R1_001_val_1_bismark_bt2_PE_report.txt

            Bismark was run with Bowtie 2 against the bisulfite genome of /gscratch/srlab/strigg/data/Pgenr/ with the specified options: -q --score-min L,0,-1.2 --ignore-quals --no-mixed --no-discordant --dovetail --minins 60 --maxins 500
            Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)
          
            Final Alignment report
            ======================
            Sequence pairs analysed in total:	20349659
            Number of paired-end alignments with a unique best hit:	13364383
            Mapping efficiency:	65.7% 
            Sequence pairs with no alignments under any condition:	3385894
            Sequence pairs did not map uniquely:	3599382
            Sequence pairs which were discarded because genomic sequence could not be extracted:	45055
          
            Number of sequence pairs with unique best (first) alignment came from the bowtie output:
            CT/GA/CT:	6344954	((converted) top strand)
            GA/CT/CT:	288878	(complementary to (converted) top strand)
            GA/CT/GA:	294928	(complementary to (converted) bottom strand)
            CT/GA/GA:	6390568	((converted) bottom strand)
          
            Final Cytosine Methylation Report
            =================================
            Total number of C's analysed:	436619499
          
            Total methylated C's in CpG context:	12108570
            Total methylated C's in CHG context:	2075398
            Total methylated C's in CHH context:	14520988
            Total methylated C's in Unknown context:	485587
          
          
            Total unmethylated C's in CpG context:	48927920
            Total unmethylated C's in CHG context:	89229392
            Total unmethylated C's in CHH context:	269757231
            Total unmethylated C's in Unknown context:	2195501
          
          
            C methylated in CpG context:	19.8%
            C methylated in CHG context:	2.3%
            C methylated in CHH context:	5.1%
            C methylated in unknown context (CN or CHN):	18.1%
          
          
            Bismark completed in 0d 6h 36m 43s
          
      • new alignment (directional mapping, aligned to padded genome) -/Volumes/web/metacarcinus/Pgenerosa/20181101/EPI-103_S27_L005_R1_001_val_1_bismark_bt2_PE_report.txt

          Bismark report for: /Volumes/web-1/Athaliana/20180516_geoduck_trimgalore_rrbs/EPI-103_S27_L005_R1_001_val_1.fq.gz and /Volumes/web-1/Athaliana/20180516_geoduck_trimgalore_rrbs/EPI-103_S27_L005_R2_001_val_2.fq.gz (version: v0.20.0)
          Bismark was run with Bowtie 2 against the bisulfite genome of /Volumes/web/metacarcinus/Pgenerosa/GENOMES/v070/padded/ with the specified options: -q --score-min L,0,-1.2 -p 8 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --minins 60 --maxins 500
          Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)
        
          Final Alignment report
          ======================
          Sequence pairs analysed in total:	20349659
          Number of paired-end alignments with a unique best hit:	12983519
          Mapping efficiency:	63.8% 
          Sequence pairs with no alignments under any condition:	3888896
          Sequence pairs did not map uniquely:	3477244
          Sequence pairs which were discarded because genomic sequence could not be extracted:	53095
        
          Number of sequence pairs with unique best (first) alignment came from the bowtie output:
          CT/GA/CT:	6437602	((converted) top strand)
          GA/CT/CT:	0	(complementary to (converted) top strand)
          GA/CT/GA:	0	(complementary to (converted) bottom strand)
          CT/GA/GA:	6492822	((converted) bottom strand)
        
          Number of alignments to (merely theoretical) complementary strands being rejected in total:	0
        
          Final Cytosine Methylation Report
          =================================
          Total number of C's analysed:	419468259
        
          Total methylated C's in CpG context:	11961463
          Total methylated C's in CHG context:	1856434
          Total methylated C's in CHH context:	4405042
          Total methylated C's in Unknown context:	133308
        
        
          Total unmethylated C's in CpG context:	48736712
          Total unmethylated C's in CHG context:	89082405
          Total unmethylated C's in CHH context:	263426203
          Total unmethylated C's in Unknown context:	1862416
        
        
          C methylated in CpG context:	19.7%
          C methylated in CHG context:	2.0%
          C methylated in CHH context:	1.6%
          C methylated in unknown context (CN or CHN):	6.7%
        
        
          Bismark completed in 0d 6h 30m 37s
          Shellys-MacBook-Pro:20181025 Shelly$ cd ..
          Shellys-MacBook-Pro:Pgenerosa Shelly$ ls
          20181011	20181025	20181101	BLAST_20181101	GENOMES
          Shellys-MacBook-Pro:Pgenerosa Shelly$ cd 20181101/
          Shellys-MacBook-Pro:20181101 Shelly$ ls
          EPI-103_S27_L005_R1_001_val_1_bismark_bt2_PE_report.txt	EPI-103_S27_L005_R1_001_val_1_bismark_bt2_pe.bam
          Shellys-MacBook-Pro:20181101 Shelly$ cat EPI-103_S27_L005_R1_001_val_1_bismark_bt2_PE_report.txt 
          Bismark report for: /Volumes/web-1/Athaliana/20180516_geoduck_trimgalore_rrbs/EPI-103_S27_L005_R1_001_val_1.fq.gz and /Volumes/web-1/Athaliana/20180516_geoduck_trimgalore_rrbs/EPI-103_S27_L005_R2_001_val_2.fq.gz (version: v0.20.0)
          Bismark was run with Bowtie 2 against the bisulfite genome of /Volumes/web/metacarcinus/Pgenerosa/GENOMES/v070/padded/ with the specified options: -q --score-min L,0,-1.2 -p 8 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --minins 60 --maxins 500
          Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)
        
          Final Alignment report
          ======================
          Sequence pairs analysed in total:	20349659
          Number of paired-end alignments with a unique best hit:	12983519
          Mapping efficiency:	63.8% 
          Sequence pairs with no alignments under any condition:	3888896
          Sequence pairs did not map uniquely:	3477244
          Sequence pairs which were discarded because genomic sequence could not be extracted:	53095
        
          Number of sequence pairs with unique best (first) alignment came from the bowtie output:
          CT/GA/CT:	6437602	((converted) top strand)
          GA/CT/CT:	0	(complementary to (converted) top strand)
          GA/CT/GA:	0	(complementary to (converted) bottom strand)
          CT/GA/GA:	6492822	((converted) bottom strand)
        
          Number of alignments to (merely theoretical) complementary strands being rejected in total:	0
        
          Final Cytosine Methylation Report
          =================================
          Total number of C's analysed:	419468259
        
          Total methylated C's in CpG context:	11961463
          Total methylated C's in CHG context:	1856434
          Total methylated C's in CHH context:	4405042	
          Total methylated C's in Unknown context:	133308
        
        
          Total unmethylated C's in CpG context:	48736712
          Total unmethylated C's in CHG context:	89082405
          Total unmethylated C's in CHH context:	263426203
          Total unmethylated C's in Unknown context:	1862416
        
        
          C methylated in CpG context:	19.7%
          C methylated in CHG context:	2.0%
          C methylated in CHH context:	1.6%
          C methylated in unknown context (CN or CHN):	6.7%
        
        
          Bismark completed in 0d 6h 30m 37s
        

CONCLUSION:

  • DIRECTIONAL MAPPING AND PADDED GENOME LED TO SLIGHTLY LOWER MAPPING EFFICIENCY
  • PADDED GENOME DID NOT IMPROVE THE NUMBER OF SEQUENCES THAT COULD BE EXTRACTED
  • submitted job 417196 with 28 processors, with directional mapping, and aligning to the original genome (/Volumes/web/metacarcinus/Pgenerosa/GENOMES/v070/Pgenerosa_v070.fa)