Create 10x CpG matrix for Peve and Ptua WGBS

Purpose

This post is about completing the preliminary WGBS analysis and takes the .cov files from the bismark output, filters for 10X coverage, and merges them into a matrix.

I followed SR’s code here for making 10X bedgraphs and _processed.txt files (which have two columns: one for CpG loci and one for % methylation with the header ‘CpG’ and ‘'). For merging the `_processed.txt` files, I followed SR's code here: [https://github.com/urol-e5/timeseries_molecular/blob/main/D-Apul/code/22.5-Apul-multiomic-machine-learning-sr.Rmd](https://github.com/urol-e5/timeseries_molecular/blob/main/D-Apul/code/22.5-Apul-multiomic-machine-learning-sr.Rmd).

Results

Peve WGBS matrix merged-WGBS-CpG-counts.csv
Peve WGBS matrix filtered for sites with 10x coverage in at least 20 samples merged-WGBS-CpG-counts_filtered_n20.csv
Peve WGBS matrix filtered for sites with 10x coverage in all samples merged-WGBS-CpG-counts_filtered.csv
Ptua WGBS matrix merged-WGBS-CpG-counts.csv
Ptua WGBS matrix for sites with 10x coverage in at least 20 samples merged-WGBS-CpG-counts_filtered_n20.csv
Ptua WGBS matrix filtered for sites with 10x coverage in all samples merged-WGBS-CpG-counts_filtered.csv
Peve script output: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250821_meth_Peve/
Ptua script output: https://gannet.fish.washington.edu/metacarcinus/E5/Ptuahiniensis/20250821_meth_Ptua/

Methods

I first copied data to klone from gannet

rsync --archive --verbose --progress shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni/20250619_methylseq/bismark/coverage2cytosine/coverage/ /gscratch/scrubbed/strigg/analyses/20250821_meth_Peve

rsync --archive --verbose --progress shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/coverage2cytosine/coverage/  /gscratch/scrubbed/strigg/analyses/20250821_meth_Ptua

Then I ran a slurm script that creates the bedgraphs and processed.txt files, and calls a python script that does the merging.

Peve: 20250821_mergeCov_Peve.sh
Ptua: 20250821_mergeCov_Ptua.sh

Here’s the full code for Peve as an example

#!/bin/bash
## Job Name
#SBATCH --job-name=20250821_mergeCov
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=cpu-g2
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=1-00:00:00
## Memory per node
#SBATCH --mem=300G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=strigg@uw.edu
## Specify the working directory for this job 
#SBATCH --chdir=/gscratch/scrubbed/strigg/analyses/20250821_meth_Peve

%%bash

set -ex


# make bed file from cov file keeping only CpGs w. 10x cov
for f in *.CpG_report.merged_CpG_evidence.cov.gz
do
  STEM=$(basename "${f}") # Get the entire filename including the long suffix
  STEM="${STEM%*.CpG_report.merged_CpG_evidence.cov.gz}" # Remove the suffix using parameter expansion
  zcat "${f}" | awk -F $'\t' 'BEGIN {OFS = FS} {if ($5+$6 >= 10) {print $1, $2, $3, $4}}' \
  > "${STEM}"_10x.bedgraph
done

# create modified tables with two columns; one is the CpG ID which is merged chrom and start site; one is the %meth 
for file in *10x.bedgraph; do
    awk -F"\t" -v fname="${file%_10x*}" 'BEGIN {print "CpG\t" fname}{print "CpG_"_$1"_"$2"\t"$4}' "$file" > "${file%.bedgraph}_processed.txt"
done

python /gscratch/srlab/strigg/scripts/merge_processed_txt.py

Here’s the python script (also here: merge_processed_txt.py)

import pandas as pd
import glob

# Define the directory and file pattern
file_pattern = '*_processed.txt'  # Adjust for your file type and location

# Get a list of all matching file paths
file_paths = glob.glob(file_pattern)

# Read each file into a DataFrame and store them in a list
list_of_dfs = [pd.read_csv(file_path, sep ='\t') for file_path in file_paths]

# merge dfs together
if list_of_dfs:
    merged_data = list_of_dfs[0]
    for i in range(1, len(list_of_dfs)):
        merged_data = pd.merge(merged_data, list_of_dfs[i], on="CpG", how="outer")
else:
    merged_data = pd.DataFrame() # Or handle the empty list case as needed


# Save the merged DataFrame to a CSV file
if not merged_data.empty: # Only save if the DataFrame is not empty
    merged_data.to_csv('merged-WGBS-CpG-counts.csv', index=False)

# Remove rows with NA values
if not merged_data.empty:
    filtered_data = merged_data.dropna()  # This removes any row with at least one NA value

# Save the filtered DataFrame to a new CSV file
filtered_data.to_csv('merged-WGBS-CpG-counts_filtered.csv', index=False)

To filter for CpGs in at least 20 samples, I ran this awk code:

awk -F',' 'NF >= 21 && $1 != "" {count = 0;for (i = 2; i <= NF; i++) {if ($i != "") {count++;}}if (count >= 20) {print;}}' merged-WGBS-CpG-counts.csv > merged-WGBS-CpG-counts_filtered_n20.csv

Then I copied the results to gannet

sync --progress --verbose --archive --exclude *.gz 20250821_meth_Ptua shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Ptuahiniensis

rsync --progress --verbose --archive --exclude *.gz 20250821_meth_Peve shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni

Failed attempts

In my initial attempt at this, I mistakenly used the .cov.gz files where neighboring CpGs were not merged

get data from gannet

rsync --progress --verbose --archive shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni/20250619_methylseq/bismark/methylation_calls/methylation_coverage/*.cov.gz  /gscratch/scrubbed/strigg/analyses/20250728_meth_Peve

rsync --progress --verbose --archive shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Ptuahiniensis/20250422_methylseq/bismark/methylation_calls/methylation_coverage/*.cov.gz  /gscratch/scrubbed/strigg/analyses/20250728_meth_Ptua

scripts to make the CpG matrix

Peve: 20250728_mergeCov.sh
Ptua: 20250728_mergeCov_Ptua.sh

copy data back to gannet

rsync --progress --verbose --archive --exclude '*.gz' 20250728_meth_Peve shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni

rsync --progress --verbose --archive --exclude '*.gz' 20250728_meth_Ptua shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Ptuahiniensis