Background

This post is related to github issue #30

The intent is to examine genetic related across the samples and to get a mild sense of what impact genetic variants might have on methylation and expression. Mild because variants are being predicted from expression and methylation data as opposed to directly being measured in WGS data.

Ideas for SNV analysis:

Methods

EM-seq

I’m going to try Biscuit based on the Sun et al. (2023) Epigenomics paper linked above.

install biscuit

conda install -c bioconda biscuit

conda install bioconda::bedtools

conda install bioconda::samtools

Run biscuit on subset of data

# initiate screen session
screen -S biscuit 
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=50GB --time=8:00:00

# Create index of the reference genome (only needs to be run once for each reference)
# Gzipped FASTA references can also be used

biscuit index /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta

#create a subset of reads from one bam file
# from here: https://bioinformatics.stackexchange.com/questions/3565/subset-smaller-bam-to-contain-several-thousand-rows-from-multiple-chromosomes


samtools view -bo POC-201-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-201-TP1.deduplicated.sorted.bam 

#count number of lines in subsetted data
samtools view POC-201-TP1.deduplicated.sorted_subset.bam | wc -l
48074

#need to create an index because it is not sorted anymore
samtools index POC-201-TP1.deduplicated.sorted_subset.bam 


# Create a pileup VCF of DNA methylation and genetic information
biscuit pileup -@ 1 -o my_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
/gscratch/scrubbed/strigg/analyses/20250506_biscuit/POC-201-TP1.deduplicated.sorted_subset.bam

# Extract DNA methylation into BED format
biscuit vcf2bed -t snp my_pileup.vcf> my_mutation_data.bed

The columns for -t snp are:

  1. Chromosome
  2. Start position (0-based)
  3. End position
  4. Reference base
  5. Alternate base
  6. Genotype (GT tag)
  7. Base support (SP tag)
  8. Coverage (AC tag)
  9. Allele frequency (AF1 tag)

Preview of mutation data

(base) [strigg@n3439 20250506_biscuit]$ head my_mutation_data.bed 
Pocillopora_meandrina_HIv1___Sc0000000  29112   29113   A       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  33255   33256   A       N       1/1     Y1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  33365   33366   A       N       1/1     Y1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50181   50182   A       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50205   50206   T       C       1/1     C1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50348   50349   T       N       1/1     R1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50366   50367   T       N       1/1     R1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  51334   51335   G       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  62385   62386   A       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  101459  101460  C       T       1/1     T1      1       1.00

I also wanted to see how the function for multiple samples worked

samtools index POC-40-TP1.deduplicated.sorted_subset.bam

samtools view -bo POC-40-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-40-TP1.deduplicated.sorted.bam

#run biscuit pileup on multiple samples
biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam 

I wanted to see if the samples are still distinguishable in the output so I ran grep POC combined_pileup.vcf and that returned:

##program=<cmd=biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  POC-201-TP1.deduplicated.sorted_subsetPOC-40-TP1.deduplicated.sorted_subset

So it appears in the vcf file the data from different samples are kept seperate in different columns.

I wanted to see if this held true in the bedfile since the vcf file is not easy to work with. I ran biscuit vcf2bed -t snp combined_pileup.vcf | less and didn’t see the sample names when I scanned through the file. I tried biscuit vcf2bed -t snp combined_pileup.vcf | grep POC and this returned nothing so the bed file doesn’t not keep sample data separate.

This made me opt for writing a script to loop through each deduplicated.bam file and create a vcf and bed file for each.

Run Biscuit for multiple samples

SLURM script here: 20250507_biscuit.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20250507_biscuit
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=250G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job 
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250507_biscuit

set -ex

#activate conda env

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh" ]; then
        . "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh"
    else
        export PATH="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin:$PATH"
    fi
fi
unset __conda_setup

if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh" ]; then
    . "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh"
fi
# <<< conda initialize <<<

#define directory 

bismark_dir="/gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/"

biscuit_dir="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/"

# Create a pileup VCF of genetic information for each sample

find ${bismark_dir}*.deduplicated.sorted.bam| \
xargs basename -s .deduplicated.sorted.bam |\
xargs -I{} ${biscuit_dir}biscuit pileup -@ 24 -o {}_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
${bismark_dir}{}.deduplicated.sorted.bam

# Also compresses and indexes the VCF
bgzip -@ 24 *.vcf
tabix -p vcf *.vcf.gz


# Extract DNA methylation into BED format

find *.vcf.gz |\
xargs basename -s _pileup.vcf.gz |\
xargs -I{} ${biscuit_dir}biscuit vcf2bed -t snp {}_pileup.vcf.gz > {}_mutation_data.bed

# Also compresses and indexes the BED
bgzip *_mutation_data.bed
tabix -p bed *_mutation_data.bed.gz



RNAvar

Run nf-core rnavar pipeline

# initiate screen session
screen -S rnavar

#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00

#activate nextflow environment
mamba activate nextflow

#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results

if I am supplying a fasta and a gtf using the –fasta and –gtf parameters and not using the –genome parameter, why is the pipeline still defaulting to GRCh38?

trying this as recommended by slack

#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00

#activate nextflow environment
mamba activate nextflow

#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--genome null \
--igenomes_ignore \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results

Results