Identify variants in RNAseq and EM-seq data from P.tuahiniensis

Background

This post is related to github issue #30

The intent is to examine genetic related across the samples and to get a mild sense of what impact genetic variants might have on methylation and expression. Mild because variants are being predicted from expression and methylation data as opposed to directly being measured in WGS data.

Ideas for SNV analysis:

RNAseq:
- nf-core rnavar
EM-seq:
- Comparison of SNV analysis methods on methylation data https://pmc.ncbi.nlm.nih.gov/articles/PMC10072131/
- Biscuit: https://huishenlab.github.io/biscuit/
- BS-SNPer: https://github.com/hellbelly/BS-Snper

Methods

EM-seq

I’m going to try Biscuit based on the Sun et al. (2023) Epigenomics paper linked above.

install biscuit

conda install -c bioconda biscuit

conda install bioconda::bedtools

conda install bioconda::samtools

Run biscuit on subset of data

# initiate screen session
screen -S biscuit 
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=50GB --time=8:00:00

# Create index of the reference genome (only needs to be run once for each reference)
# Gzipped FASTA references can also be used

biscuit index /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta

#create a subset of reads from one bam file
# from here: https://bioinformatics.stackexchange.com/questions/3565/subset-smaller-bam-to-contain-several-thousand-rows-from-multiple-chromosomes


samtools view -bo POC-201-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-201-TP1.deduplicated.sorted.bam 

#count number of lines in subsetted data
samtools view POC-201-TP1.deduplicated.sorted_subset.bam | wc -l
48074

#need to create an index because it is not sorted anymore
samtools index POC-201-TP1.deduplicated.sorted_subset.bam 


# Create a pileup VCF of DNA methylation and genetic information
biscuit pileup -@ 1 -o my_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
/gscratch/scrubbed/strigg/analyses/20250506_biscuit/POC-201-TP1.deduplicated.sorted_subset.bam

# Extract DNA methylation into BED format
biscuit vcf2bed -t snp my_pileup.vcf> my_mutation_data.bed

The columns for -t snp are:

Chromosome
Start position (0-based)
End position
Reference base
Alternate base
Genotype (GT tag)
Base support (SP tag)
Coverage (AC tag)
Allele frequency (AF1 tag)

Preview of mutation data

(base) [strigg@n3439 20250506_biscuit]$ head my_mutation_data.bed 
Pocillopora_meandrina_HIv1___Sc0000000  29112   29113   A       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  33255   33256   A       N       1/1     Y1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  33365   33366   A       N       1/1     Y1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50181   50182   A       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50205   50206   T       C       1/1     C1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50348   50349   T       N       1/1     R1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  50366   50367   T       N       1/1     R1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  51334   51335   G       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  62385   62386   A       T       1/1     T1      1       1.00
Pocillopora_meandrina_HIv1___Sc0000000  101459  101460  C       T       1/1     T1      1       1.00

I also wanted to see how the function for multiple samples worked

samtools index POC-40-TP1.deduplicated.sorted_subset.bam

samtools view -bo POC-40-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-40-TP1.deduplicated.sorted.bam

#run biscuit pileup on multiple samples
biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam 

I wanted to see if the samples are still distinguishable in the output so I ran grep POC combined_pileup.vcf and that returned:

##program=<cmd=biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  POC-201-TP1.deduplicated.sorted_subsetPOC-40-TP1.deduplicated.sorted_subset

So it appears in the vcf file the data from different samples are kept seperate in different columns.

I wanted to see if this held true in the bedfile since the vcf file is not easy to work with. I ran biscuit vcf2bed -t snp combined_pileup.vcf | less and didn’t see the sample names when I scanned through the file. I tried biscuit vcf2bed -t snp combined_pileup.vcf | grep POC and this returned nothing so the bed file doesn’t not keep sample data separate.

This made me opt for writing a script to loop through each deduplicated.bam file and create a vcf and bed file for each.

Run Biscuit for multiple samples

SLURM script here: 20250507_biscuit.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20250507_biscuit
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=250G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job 
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250507_biscuit

set -ex

#activate conda env

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh" ]; then
        . "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh"
    else
        export PATH="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin:$PATH"
    fi
fi
unset __conda_setup

if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh" ]; then
    . "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh"
fi
# <<< conda initialize <<<

#define directory 

bismark_dir="/gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/"

biscuit_dir="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/"

# Create a pileup VCF of genetic information for each sample

find ${bismark_dir}*.deduplicated.sorted.bam| \
xargs basename -s .deduplicated.sorted.bam |\
xargs -I{} ${biscuit_dir}biscuit pileup -@ 24 -o {}_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
${bismark_dir}{}.deduplicated.sorted.bam

# Also compresses and indexes the VCF
bgzip -@ 24 *.vcf
tabix -p vcf *.vcf.gz


# Extract DNA methylation into BED format

find *.vcf.gz |\
xargs basename -s _pileup.vcf.gz |\
xargs -I{} ${biscuit_dir}biscuit vcf2bed -t snp {}_pileup.vcf.gz > {}_mutation_data.bed

# Also compresses and indexes the BED
bgzip *_mutation_data.bed
tabix -p bed *_mutation_data.bed.gz

Something funky is happening with the xargs command and the redirect so need to make sure the syntax is ok

I changed the syntax to a for loop and that did the trick

for fq in *.vcf.gz 
do
  # Set filename
  file_name=$(echo "${fq}" | awk -F"_" '{print $1"_mutation_data.bed"}')
  
  # Run vcf2bed
  biscuit vcf2bed \
  -t snp \
  ${fq} \
  > ${file_name}
done

RNAvar

Run nf-core rnavar pipeline

# initiate screen session
screen -S rnavar

#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00

#activate nextflow environment
mamba activate nextflow

#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results

if I am supplying a fasta and a gtf using the –fasta and –gtf parameters and not using the –genome parameter, why is the pipeline still defaulting to GRCh38?

trying this as recommended by slack

#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00

#activate nextflow environment
mamba activate nextflow

#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--genome null \
--igenomes_ignore \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results

Because the annotations aren’t working with the pipeline, perhaps I can convert the vcf file to a bed file like I did for EM-seq data with biscuit.

#!/bin/bash
## Job Name
#SBATCH --job-name=20250507_biscuit
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=250G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job 
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250506_rnavar/results/variant_calling/POC-201-TP1

set -ex

#activate conda env

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh" ]; then
        . "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh"
    else
        export PATH="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin:$PATH"
    fi
fi
unset __conda_setup

if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh" ]; then
    . "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh"
fi
# <<< conda initialize <<<



# Extract DNA methylation into BED format

for fq in *.vcf.gz 
do
  # Set filename
  file_name=$(echo "${fq}" | awk -F"_" '{print $1"_mutation_data.bed"}')
  
  # Run vcf2bed
  biscuit vcf2bed \
  -t snp \
  ${fq} \
  > ${file_name}
done

# Also compresses and indexes the BED
bgzip *_mutation_data.bed

it doesn’t like the vcf file but maybe I can reformat it. I think it’s just the header

Results

EM-seq variants
- output from subset analysis: 20250506_biscuit
RNA variants

Next steps for SNV analysis

to relate genotype to epigenotype:

for modeling, you don’t need to predefine DML or DMR unless you want to reduce comp. burden/multiple testing or you’re comparing disease vs. control and only care about sites with group-level changes

overlapping SNV and meth loci
overlapping SNV and meth region

output from chatgpt prompt: what are some ways to analyze DNA methylation data and SNV data that are in bed format?

Integration Analysis Bedtools intersect (exact or +/- 1kb) of
- DML and snps and annotations
- DMR and SNPs and annotations
Questions to ask:
- Are SNVs spatially related to methylation loci in DMRs, CpG islands, promoters, regulatory elements?
- Do SNVs disrupt CpG sites and affect methylation potential?
- Are SNVs enriched in hypo-/hypermethylated regions?
- Are co-localized SNVs and methylation changes enriched in certain biological pathways?
  - GREAT, HOMER, or ChIPseeker
Correlation Analysis: Quantify methylation at loci near SNVs
- Examine correlations between methylation levels and presence/absence of nearby SNVs.
Visualization
- IGV or UCSC Genome Browser: Load BED files to visually inspect overlaps.
- Heatmaps or Circos Plots: Summarize co-distribution genome-wide.
- Manhattan Plots: For methylation QTLs or DMRs associated with variants.
Statistical Modeling
- Regression Models
  - logistic or linear regression to test associations between SNVs and methylation levels (define mQTLs)
  - is a specific SNV associated with more or less methylation at a nearby CpG site?
    - % meth ~ SNV genotype + site + timepoint
      - tells you significance of genotype effect on meth
    - site ~ % meth * SNV + time
    - Logistic regression: Sometimes used for binary methylation (e.g., high/low).
    - Mixed-effects models: For repeated measures or hierarchical data.
- Machine Learning
  - Use methylation and SNV features to classify disease vs. control samples.
Functional analysis
- Annotations:
  - GTF features
  - Enhancers defined in other studies
  - TF binding sites
  - De novo enhancer finding software
  - ChIPseq or ATACseq data