Background
This post is related to github issue #30
The intent is to examine genetic related across the samples and to get a mild sense of what impact genetic variants might have on methylation and expression. Mild because variants are being predicted from expression and methylation data as opposed to directly being measured in WGS data.
Ideas for SNV analysis:
- RNAseq:
- EM-seq:
- Comparison of SNV analysis methods on methylation data https://pmc.ncbi.nlm.nih.gov/articles/PMC10072131/
- Biscuit: https://huishenlab.github.io/biscuit/
- BS-SNPer: https://github.com/hellbelly/BS-Snper
Methods
EM-seq
I’m going to try Biscuit based on the Sun et al. (2023) Epigenomics paper linked above.
install biscuit
conda install -c bioconda biscuit
conda install bioconda::bedtools
conda install bioconda::samtools
Run biscuit on subset of data
# initiate screen session
screen -S biscuit
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=50GB --time=8:00:00
# Create index of the reference genome (only needs to be run once for each reference)
# Gzipped FASTA references can also be used
biscuit index /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta
#create a subset of reads from one bam file
# from here: https://bioinformatics.stackexchange.com/questions/3565/subset-smaller-bam-to-contain-several-thousand-rows-from-multiple-chromosomes
samtools view -bo POC-201-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-201-TP1.deduplicated.sorted.bam
#count number of lines in subsetted data
samtools view POC-201-TP1.deduplicated.sorted_subset.bam | wc -l
48074
#need to create an index because it is not sorted anymore
samtools index POC-201-TP1.deduplicated.sorted_subset.bam
# Create a pileup VCF of DNA methylation and genetic information
biscuit pileup -@ 1 -o my_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
/gscratch/scrubbed/strigg/analyses/20250506_biscuit/POC-201-TP1.deduplicated.sorted_subset.bam
# Extract DNA methylation into BED format
biscuit vcf2bed -t snp my_pileup.vcf> my_mutation_data.bed
The columns for -t snp are:
- Chromosome
- Start position (0-based)
- End position
- Reference base
- Alternate base
- Genotype (GT tag)
- Base support (SP tag)
- Coverage (AC tag)
- Allele frequency (AF1 tag)
Preview of mutation data
(base) [strigg@n3439 20250506_biscuit]$ head my_mutation_data.bed
Pocillopora_meandrina_HIv1___Sc0000000 29112 29113 A T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 33255 33256 A N 1/1 Y1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 33365 33366 A N 1/1 Y1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50181 50182 A T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50205 50206 T C 1/1 C1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50348 50349 T N 1/1 R1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50366 50367 T N 1/1 R1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 51334 51335 G T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 62385 62386 A T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 101459 101460 C T 1/1 T1 1 1.00
I also wanted to see how the function for multiple samples worked
samtools index POC-40-TP1.deduplicated.sorted_subset.bam
samtools view -bo POC-40-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-40-TP1.deduplicated.sorted.bam
#run biscuit pileup on multiple samples
biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam
I wanted to see if the samples are still distinguishable in the output so I ran grep POC combined_pileup.vcf
and that returned:
##program=<cmd=biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT POC-201-TP1.deduplicated.sorted_subsetPOC-40-TP1.deduplicated.sorted_subset
So it appears in the vcf file the data from different samples are kept seperate in different columns.
I wanted to see if this held true in the bedfile since the vcf file is not easy to work with. I ran biscuit vcf2bed -t snp combined_pileup.vcf | less
and didn’t see the sample names when I scanned through the file. I tried biscuit vcf2bed -t snp combined_pileup.vcf | grep POC
and this returned nothing so the bed file doesn’t not keep sample data separate.
This made me opt for writing a script to loop through each deduplicated.bam file and create a vcf and bed file for each.
Run Biscuit for multiple samples
SLURM script here: 20250507_biscuit.sh
#!/bin/bash
## Job Name
#SBATCH --job-name=20250507_biscuit
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=250G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250507_biscuit
set -ex
#activate conda env
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh" ]; then
. "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh"
else
export PATH="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin:$PATH"
fi
fi
unset __conda_setup
if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh" ]; then
. "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh"
fi
# <<< conda initialize <<<
#define directory
bismark_dir="/gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/"
biscuit_dir="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/"
# Create a pileup VCF of genetic information for each sample
find ${bismark_dir}*.deduplicated.sorted.bam| \
xargs basename -s .deduplicated.sorted.bam |\
xargs -I{} ${biscuit_dir}biscuit pileup -@ 24 -o {}_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
${bismark_dir}{}.deduplicated.sorted.bam
# Also compresses and indexes the VCF
bgzip -@ 24 *.vcf
tabix -p vcf *.vcf.gz
# Extract DNA methylation into BED format
find *.vcf.gz |\
xargs basename -s _pileup.vcf.gz |\
xargs -I{} ${biscuit_dir}biscuit vcf2bed -t snp {}_pileup.vcf.gz > {}_mutation_data.bed
# Also compresses and indexes the BED
bgzip *_mutation_data.bed
tabix -p bed *_mutation_data.bed.gz
Something funky is happening with the xargs command and the redirect so need to make sure the syntax is ok
I changed the syntax to a for loop and that did the trick
for fq in *.vcf.gz
do
# Set filename
file_name=$(echo "${fq}" | awk -F"_" '{print $1"_mutation_data.bed"}')
# Run vcf2bed
biscuit vcf2bed \
-t snp \
${fq} \
> ${file_name}
done
RNAvar
Run nf-core rnavar pipeline
# initiate screen session
screen -S rnavar
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00
#activate nextflow environment
mamba activate nextflow
#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results
if I am supplying a fasta and a gtf using the –fasta and –gtf parameters and not using the –genome parameter, why is the pipeline still defaulting to GRCh38?
trying this as recommended by slack
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00
#activate nextflow environment
mamba activate nextflow
#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--genome null \
--igenomes_ignore \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results
Because the annotations aren’t working with the pipeline, perhaps I can convert the vcf file to a bed file like I did for EM-seq data with biscuit.
#!/bin/bash
## Job Name
#SBATCH --job-name=20250507_biscuit
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=250G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250506_rnavar/results/variant_calling/POC-201-TP1
set -ex
#activate conda env
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh" ]; then
. "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh"
else
export PATH="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin:$PATH"
fi
fi
unset __conda_setup
if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh" ]; then
. "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh"
fi
# <<< conda initialize <<<
# Extract DNA methylation into BED format
for fq in *.vcf.gz
do
# Set filename
file_name=$(echo "${fq}" | awk -F"_" '{print $1"_mutation_data.bed"}')
# Run vcf2bed
biscuit vcf2bed \
-t snp \
${fq} \
> ${file_name}
done
# Also compresses and indexes the BED
bgzip *_mutation_data.bed
it doesn’t like the vcf file but maybe I can reformat it. I think it’s just the header
Results
- EM-seq variants
- output from subset analysis: 20250506_biscuit
- RNA variants
Next steps for SNV analysis
to relate genotype to epigenotype:
for modeling, you don’t need to predefine DML or DMR unless you want to reduce comp. burden/multiple testing or you’re comparing disease vs. control and only care about sites with group-level changes
- overlapping SNV and meth loci
- overlapping SNV and meth region
output from chatgpt prompt: what are some ways to analyze DNA methylation data and SNV data that are in bed format?
- Integration Analysis
Bedtools intersect (exact or +/- 1kb) of
- DML and snps and annotations
- DMR and SNPs and annotations
Questions to ask:
- Are SNVs spatially related to methylation loci in DMRs, CpG islands, promoters, regulatory elements?
- Do SNVs disrupt CpG sites and affect methylation potential?
- Are SNVs enriched in hypo-/hypermethylated regions?
- Are co-localized SNVs and methylation changes enriched in certain biological pathways?
- GREAT, HOMER, or ChIPseeker
-
Correlation Analysis: Quantify methylation at loci near SNVs
- Examine correlations between methylation levels and presence/absence of nearby SNVs.
-
Visualization
-
IGV or UCSC Genome Browser: Load BED files to visually inspect overlaps.
-
Heatmaps or Circos Plots: Summarize co-distribution genome-wide.
-
Manhattan Plots: For methylation QTLs or DMRs associated with variants.
-
-
Statistical Modeling
- Regression Models
- logistic or linear regression to test associations between SNVs and methylation levels (define mQTLs)
- is a specific SNV associated with more or less methylation at a nearby CpG site?
- % meth ~ SNV genotype + site + timepoint
- tells you significance of genotype effect on meth
- site ~ % meth * SNV + time
- Logistic regression: Sometimes used for binary methylation (e.g., high/low).
- Mixed-effects models: For repeated measures or hierarchical data.
- % meth ~ SNV genotype + site + timepoint
- Machine Learning
- Use methylation and SNV features to classify disease vs. control samples.
- Regression Models
- Functional analysis
- Annotations:
- GTF features
- Enhancers defined in other studies
- TF binding sites
- De novo enhancer finding software
- ChIPseq or ATACseq data
- Annotations: