Background
This post is related to github issue #30
The intent is to examine genetic related across the samples and to get a mild sense of what impact genetic variants might have on methylation and expression. Mild because variants are being predicted from expression and methylation data as opposed to directly being measured in WGS data.
Ideas for SNV analysis:
- RNAseq:
- EM-seq:
- Comparison of SNV analysis methods on methylation data https://pmc.ncbi.nlm.nih.gov/articles/PMC10072131/
- Biscuit: https://huishenlab.github.io/biscuit/
- BS-SNPer: https://github.com/hellbelly/BS-Snper
Methods
EM-seq
I’m going to try Biscuit based on the Sun et al. (2023) Epigenomics paper linked above.
install biscuit
conda install -c bioconda biscuit
conda install bioconda::bedtools
conda install bioconda::samtools
Run biscuit on subset of data
# initiate screen session
screen -S biscuit
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=50GB --time=8:00:00
# Create index of the reference genome (only needs to be run once for each reference)
# Gzipped FASTA references can also be used
biscuit index /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta
#create a subset of reads from one bam file
# from here: https://bioinformatics.stackexchange.com/questions/3565/subset-smaller-bam-to-contain-several-thousand-rows-from-multiple-chromosomes
samtools view -bo POC-201-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-201-TP1.deduplicated.sorted.bam
#count number of lines in subsetted data
samtools view POC-201-TP1.deduplicated.sorted_subset.bam | wc -l
48074
#need to create an index because it is not sorted anymore
samtools index POC-201-TP1.deduplicated.sorted_subset.bam
# Create a pileup VCF of DNA methylation and genetic information
biscuit pileup -@ 1 -o my_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
/gscratch/scrubbed/strigg/analyses/20250506_biscuit/POC-201-TP1.deduplicated.sorted_subset.bam
# Extract DNA methylation into BED format
biscuit vcf2bed -t snp my_pileup.vcf> my_mutation_data.bed
The columns for -t snp are:
- Chromosome
- Start position (0-based)
- End position
- Reference base
- Alternate base
- Genotype (GT tag)
- Base support (SP tag)
- Coverage (AC tag)
- Allele frequency (AF1 tag)
Preview of mutation data
(base) [strigg@n3439 20250506_biscuit]$ head my_mutation_data.bed
Pocillopora_meandrina_HIv1___Sc0000000 29112 29113 A T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 33255 33256 A N 1/1 Y1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 33365 33366 A N 1/1 Y1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50181 50182 A T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50205 50206 T C 1/1 C1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50348 50349 T N 1/1 R1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 50366 50367 T N 1/1 R1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 51334 51335 G T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 62385 62386 A T 1/1 T1 1 1.00
Pocillopora_meandrina_HIv1___Sc0000000 101459 101460 C T 1/1 T1 1 1.00
I also wanted to see how the function for multiple samples worked
samtools index POC-40-TP1.deduplicated.sorted_subset.bam
samtools view -bo POC-40-TP1.deduplicated.sorted_subset.bam -s 123.001 /gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/POC-40-TP1.deduplicated.sorted.bam
#run biscuit pileup on multiple samples
biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam
I wanted to see if the samples are still distinguishable in the output so I ran grep POC combined_pileup.vcf
and that returned:
##program=<cmd=biscuit pileup -@ 1 -o combined_pileup.vcf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta POC-201-TP1.deduplicated.sorted_subset.bam POC-40-TP1.deduplicated.sorted_subset.bam>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT POC-201-TP1.deduplicated.sorted_subsetPOC-40-TP1.deduplicated.sorted_subset
So it appears in the vcf file the data from different samples are kept seperate in different columns.
I wanted to see if this held true in the bedfile since the vcf file is not easy to work with. I ran biscuit vcf2bed -t snp combined_pileup.vcf | less
and didn’t see the sample names when I scanned through the file. I tried biscuit vcf2bed -t snp combined_pileup.vcf | grep POC
and this returned nothing so the bed file doesn’t not keep sample data separate.
This made me opt for writing a script to loop through each deduplicated.bam file and create a vcf and bed file for each.
Run Biscuit for multiple samples
SLURM script here: 20250507_biscuit.sh
#!/bin/bash
## Job Name
#SBATCH --job-name=20250507_biscuit
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=250G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250507_biscuit
set -ex
#activate conda env
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh" ]; then
. "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/conda.sh"
else
export PATH="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin:$PATH"
fi
fi
unset __conda_setup
if [ -f "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh" ]; then
. "/mmfs1/gscratch/srlab/nextflow/bin/miniforge/etc/profile.d/mamba.sh"
fi
# <<< conda initialize <<<
#define directory
bismark_dir="/gscratch/scrubbed/strigg/analyses/20250422_methylseq/bismark/deduplicated/"
biscuit_dir="/mmfs1/gscratch/srlab/nextflow/bin/miniforge/bin/"
# Create a pileup VCF of genetic information for each sample
find ${bismark_dir}*.deduplicated.sorted.bam| \
xargs basename -s .deduplicated.sorted.bam |\
xargs -I{} ${biscuit_dir}biscuit pileup -@ 24 -o {}_pileup.vcf \
/gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
${bismark_dir}{}.deduplicated.sorted.bam
# Also compresses and indexes the VCF
bgzip -@ 24 *.vcf
tabix -p vcf *.vcf.gz
# Extract DNA methylation into BED format
find *.vcf.gz |\
xargs basename -s _pileup.vcf.gz |\
xargs -I{} ${biscuit_dir}biscuit vcf2bed -t snp {}_pileup.vcf.gz > {}_mutation_data.bed
# Also compresses and indexes the BED
bgzip *_mutation_data.bed
tabix -p bed *_mutation_data.bed.gz
RNAvar
Run nf-core rnavar pipeline
# initiate screen session
screen -S rnavar
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00
#activate nextflow environment
mamba activate nextflow
#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results
if I am supplying a fasta and a gtf using the –fasta and –gtf parameters and not using the –genome parameter, why is the pipeline still defaulting to GRCh38?
trying this as recommended by slack
#start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=8:00:00
#activate nextflow environment
mamba activate nextflow
#run rnavar nf-core pipelnie
nextflow run nf-core/rnavar \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
-resume \
--input samplesheet.csv \
--genome null \
--igenomes_ignore \
--fasta /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.assembly.fasta \
--gtf /gscratch/srlab/strigg/GENOMES/Pocillopora_meandrina_HIv1.genes-validated.gtf \
--tools merge \
--skip_baserecalibration true \
--outdir results
Results
- EM-seq variants
- output from subset analysis: 20250506_biscuit
- RNA variants