Purpose

This post is about completing the preliminary WGBS analysis and takes the .cov files from the bismark output, filters for 10X coverage, and merges them into a matrix.

I followed SR’s code here for making 10X bedgraphs and _processed.txt files (which have two columns: one for CpG loci and one for % methylation with the header ‘CpG’ and ‘'). For merging the `_processed.txt` files, I followed SR's code here: [https://github.com/urol-e5/timeseries_molecular/blob/main/D-Apul/code/22.5-Apul-multiomic-machine-learning-sr.Rmd](https://github.com/urol-e5/timeseries_molecular/blob/main/D-Apul/code/22.5-Apul-multiomic-machine-learning-sr.Rmd).

Results

Methods

I first copied data to klone from gannet

rsync --progress --verbose --archive shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni/20250619_methylseq/bismark/methylation_calls/methylation_coverage/*.cov.gz  /gscratch/scrubbed/strigg/analyses/20250728_meth_Peve

rsync --progress --verbose --archive shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/methylation_calls/methylation_coverage/*.cov.gz  /gscratch/scrubbed/strigg/analyses/20250728_meth_Ptua

Then I ran a slurm script that creates the bedgraphs and processed.txt files, and calls a python script that does the merging.

Here’s the full code for Peve as an example

#!/bin/bash
## Job Name
#SBATCH --job-name=20250728_mergeCov
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=cpu-g2
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=300G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job 
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250728_meth_Peve

%%bash

set -ex


# make bed file from cov file keeping only CpGs w. 10x cov
for f in *.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz
do
  STEM=$(basename "${f}") # Get the entire filename including the long suffix
  STEM="${STEM%.POR-*.fastp-trim_bismark_bt2_pe.deduplicated.bismark.cov.gz}" # Remove the suffix using parameter expansion
  zcat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
  > "${STEM}"_10x.bedgraph
done

# create modified tables with two columns; one is the CpG ID which is merged chrom and start site; one is the %meth 
for file in *10x.bedgraph; do
    awk -F"\t" -v fname="${file%_10x*}" 'BEGIN {print "CpG\t" fname}{print "CpG_"_$1"_"$2"\t"$4}' "$file" > "${file%.bedgraph}_processed.txt"
done

python /gscratch/srlab/strigg/scripts/merge_processed_txt.py

Here’s the python script (also here: merge_processed_txt.py)

import pandas as pd
import glob

# Define the directory and file pattern
file_pattern = '*_processed.txt'  # Adjust for your file type and location

# Get a list of all matching file paths
file_paths = glob.glob(file_pattern)

# Read each file into a DataFrame and store them in a list
list_of_dfs = [pd.read_csv(file_path, sep ='\t') for file_path in file_paths]

# merge dfs together
if list_of_dfs:
    merged_data = list_of_dfs[0]
    for i in range(1, len(list_of_dfs)):
        merged_data = pd.merge(merged_data, list_of_dfs[i], on="CpG", how="outer")
else:
    merged_data = pd.DataFrame() # Or handle the empty list case as needed


# Save the merged DataFrame to a CSV file
if not merged_data.empty: # Only save if the DataFrame is not empty
    merged_data.to_csv('merged-WGBS-CpG-counts.csv', index=False)

# Remove rows with NA values
if not merged_data.empty:
    filtered_data = merged_data.dropna()  # This removes any row with at least one NA value

# Save the filtered DataFrame to a new CSV file
filtered_data.to_csv('merged-WGBS-CpG-counts_filtered.csv', index=False)

Then I copied the results to gannet

rsync --progress --verbose --archive --exclude '*.gz' 20250728_meth_Peve shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni

rsync --progress --verbose --archive --exclude '*.gz' 20250728_meth_Ptua shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Ptuahiniensis