Purpose
This post is related to github issue # 33
The purpose of this entry is to align EM-seq data to the P. evermanni genome using nf-core pipelines on Klone.
Results
EM-seq
- multiqc: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250619_methylseq/multiqc/bismark/multiqc_report.html
- Methylseq output and counts matrices: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250619_methylseq/
- nf-core pipeline report: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250619_methylseq/nf_report.html
- pipeline timeline (~16 hours): https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250619_methylseq/nf_timeline.html
Methods
create samplesheet
correct version: samplesheet.csv
Copy genome files to klone
# show path
pwd
/gscratch/srlab/strigg/GENOMES
# copy genome
wget https://gannet.fish.washington.edu/seashell/snaps/Porites_evermanni_v1.fa
#copy gtf file
wget https://github.com/urol-e5/timeseries_molecular/raw/99f0563a067ca9d010cb206dfd44b36d8f77de00/E-Peve/data/Porites_evermanni_validated.gtf
# copy gff file
wget https://github.com/urol-e5/timeseries_molecular/raw/99f0563a067ca9d010cb206dfd44b36d8f77de00/E-Peve/data/Porites_evermanni_validated.gff3
Copy WGBS to klone
# open screen session
screen -r methylseq
#make and change into dir
mkdir /gscratch/scrubbed/strigg/analyses/20250529_methylseq/data
cd /gscratch/scrubbed/strigg/analyses/20250529_methylseq/data
# copy data
rsync --progress --verbose --archive shellytrigg@gannet.fish.washington.edu:/volume2/web/gitrepos/urol-e5/timeseries_molecular/E-Peve/output/01.00-E-Peve-WGBS-trimming-fastp-FastQC-MultiQC .
Run methylseq pipeline
This is the .config file that was used in the 20250619_methylseq run that was successful: uw_hyak_srlab.config
# start interactive node
salloc -A srlab -p cpu-g2-mem2x -N 1 -c 1 --mem=16GB --time=72:00:00
#activate conda environment
mamba activate nextflow
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250619_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250619_methylseq \
--fasta /gscratch/srlab/strigg/GENOMES/Porites_evermanni_v1.fa \
--em_seq \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq
Attempts that were unsuccessful
20250529_methylseq: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250529_methylseq/
This pipeline completed with errors and didn’t process all the samples, seemingly because of memory issues and jobs not getting requeued as they should to the other nodes. I’m in contact with UW IT and Carson Miller at UW Peds about solutions.
I attempted to rerun the pipline and set the max memory to 400 GB to see if that might help. However, I have a hunch the jobs that failed didn’t even get resubmitted to attempt more memory in the first place.
Still failed. trying again with the –requeue option excluded from the .config file. As suggested by Kristen from UW IT:
06/05/2025 11:05 AM PDT - Kristen Additional Comments Hey Shelly,
Thanks for following up. We looked around in the documentation and elsewhere and we think that –requeue will requeue to the same account and partition as initially submitted. We think the –requeue flag in your second config file is in conflict with what you are trying to do. The default behavior for ckpt is that when the time limit is reached, (5:05 for CPU jobs, 8:05 for GPU jobs) the job is requeued and our slurm config for ckpt uses the –requeue flag to do this requeuing to ckpt.
We think that the NextFlow config is simply resubmitting the job and not a requeue in this sense according to Slurm’s definition of requeue. For this reason, we think you should try again without –requeue.
When you try it out and if you have more question please include the JobIDs and we can help investigate this behavior.
Thank you,
Kristen
06/05/2025 8:08 AM PDT - Shelly Additional Comments Hi Kristen,
I was able to find a workaround, but I’m running into an issue now with some jobs that are running over the time limit of ckpt. In the past, these would not get requeued on ckpt and instead get queued on my lab’s node to get around the time limit. This was all specified in my config file as follows:
process {
errorStrategy = 'retry'
maxSubmitAwait = '60 min'
maxRetries = 2
executor = 'slurm'
queue = { task.attempt < 4 ? (task.attempt < 3 ? 'ckpt' : 'cpu-g2' ) :
'cpu-g2-mem2x' }
clusterOptions = { task.attempt < 4 ? (task.attempt < 3 ? "-A srlab" :
"-A coenv" ) : "-A srlab" }
scratch = '/gscratch/scrubbed/srlab/'
resourceLimits = [
cpus: 16,
memory: '150.GB',
time: '72.h'
]
}
However, when I encountered the issue with the –no-requeue failing on ckpt, I made the following modifications to my config file:
process {
errorStrategy = {task.exitStatus in ((130..145)) ? 'ignore' : 'retry'}
maxSubmitAwait = '60 min'
maxRetries = 2
executor = 'slurm'
queue = { task.attempt < 4 ? (task.attempt < 3 ? 'ckpt-all' : 'cpu-g2'
) : 'cpu-g2-mem2x' }
clusterOptions = { task.attempt < 4 ? (task.attempt < 3 ? "-A srlab
--requeue" : "-A coenv --requeue" ) : "-A srlab --requeue" }
scratch = '/gscratch/scrubbed/srlab/'
resourceLimits = [
cpus: 16,
memory: '150.GB',
time: '72.h'
]
withName:preseq {
errorStrategy = 'ignore'
}
And now my jobs are running on ckpt but not being queued on my lab’s nodes when they fail on ckpt due to the time limit. I’m not sure how to resolve this yet so if you have any ideas, I’m all ears.
Thanks,
Shelly
In the past this prevented the pipeline from running but maybe it will work this time?
20250605_methylseq: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250605_methylseq/
The pipeline completed with errors. Report here: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250605_methylseq/pipeline_info/nf_report2.html
copied the data to Gannet:
rsync --progress --verbose --archive --exclude bismark/ --exclude work/ --exclude cat/ --exclude data/ 20250605_methylseq shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni
rsync --progress --verbose --archive multiqc shellytrigg@gannet.fish.washington.edu:/volume2/web/metacarcinus/E5/Pevermanni/20250605_methylseq
Certain samples (e.g. 73-TP3) still failed and didn’t get retried. This is because of the text that was in the config file that specifies if the error exit status is 143 or 145 (when the sample takes too long) it is ignored rather than retried. errorStrategy = {task.exitStatus in ((130..145)) ? 'ignore' : 'retry'}
. I forgot to revert that section back to how I had it before those updates errorStrategy = 'retry'
.
I reverted the .config file back to the original parameters (uw_hyak_srlab.config), retried the pipeline, and it completed (report here). I realized in comparing the number of samples that two samples got concatenated because of two errors in my samplesheet. There should be 38 samples and I only had 36 because POR-262-TP3 + TP4 and POR-73-TP3 + TP4 got concatenated. See multiqc report
I edited the sample sheet and am rerunning the pipeline again. It took 19 hours to complete so should be done by tomorrow.
nextflow run nf-core/methylseq \
-c /gscratch/srlab/strigg/bin/uw_hyak_srlab.config \
--input /gscratch/scrubbed/strigg/analyses/20250619_methylseq/samplesheet.csv \
--outdir /gscratch/scrubbed/strigg/analyses/20250619_methylseq \
--fasta /gscratch/srlab/strigg/GENOMES/Porites_evermanni_v1.fa \
--em_seq \
-resume \
-with-report nf_report.html \
-with-trace \
-with-timeline nf_timeline.html \
--skip_trimming \
--nomeseq
Conversation with Carson Miller:
Good morning Carson, I am wondering if you have any suggestions for getting around the –norequeue parameter that is in the slurm script headers produced by the nf-core pipelines? It hasn’t been a problem until this month when ckpt stopped accepting jobs with –norequeue . I added this to the discussion here https://github.com/nextflow-io/nextflow/discussions/3573, and my idea is to comment out line 55 in the file https://github.com/nextflow-io/nextflow/blob/827ee97e7132ad64989a68f4a3e30befc7167[…]nextflow/src/main/groovy/nextflow/executor/SlurmExecutor.groovy. But that is going to require rebuilding and recompiling nextflow in the conda environment which I’m not sure how to do. I would love any suggestions you might have
9 replies
Carson Miller May 9th at 3:01 PM A quick “hack” to workaround this for the time being would be to do include the following in your nextflow.config . process { clusterOptions = { “–requeue” } errorStrategy = { task.exitStatus in ((130..145)) ? ‘ignore’ : ‘retry’ } } Basically this will include –requeue in the SLURM submission so that you can still use checkpoint queue. The caveat is that if a job is pre-empted, Nextflow will think that the job failed and will try to create a new job, but the SLURM scheduler will also resubmit the job (even though this won’t be tracked by Nextflow), so what you’ll end up with is duplication of jobs and a potentially very large overhead. So I added the errorStrategy line so that Nextflow will ignore pre-emptions, with the idea that I will just resume the pipeline to clean up tasks that were pre-empted. (edited)
Shelly Wanamaker May 9th at 3:17 PM Cool hopefully that will help. Do you know how I should incorporate this code if my config file already has a process section with errorStrategy and clusterOptions ? process { errorStrategy = ‘retry’ maxSubmitAwait = ‘60 min’ maxRetries = 2 executor = ‘slurm’ queue = { task.attempt < 4 ? (task.attempt < 3 ? ‘ckpt-all’ : ‘cpu-g2’ ) : ‘cpu-g2-mem2x’ } clusterOptions = { task.attempt < 4 ? (task.attempt < 3 ? “-A srlab” : “-A coenv” ) : “-A srlab” } scratch = ‘/gscratch/scrubbed/srlab/’ resourceLimits = [ cpus: 16,i memory: ‘150.GB’, time: ‘72.h’ ]
Carson Miller May 9th at 3:19 PM Something like this should work: process { errorStrategy = { task.exitStatus in ((130..145)) ? ‘ignore’ : ‘retry’ } maxSubmitAwait = ‘60 min’ maxRetries = 2 executor = ‘slurm’ queue = { task.attempt < 4 ? (task.attempt < 3 ? ‘ckpt-all’ : ‘cpu-g2’ ) : ‘cpu-g2-mem2x’ } clusterOptions = { task.attempt < 4 ? (task.attempt < 3 ? “-A srlab –requeue” : “-A coenv –requeue” ) : “-A srlab –requeue” } scratch = ‘/gscratch/scrubbed/srlab/’ resourceLimits = [ cpus: 16,i memory: ‘150.GB’, time: ‘72.h’ ]
Shelly Wanamaker May 9th at 3:21 PM awesome!!! thank you so much :+1: 1
Carson Miller May 9th at 3:46 PM I reached out to the Hyak team about this issue, so they may contact affected people and hopefully provide a fix for this soon!
Shelly Wanamaker May 9th at 3:48 PM cool thanks! I had reached out yesterday and Kristen Finch was trying to help but your fix worked! :+1: 1
Shelly Wanamaker Jun 5th at 11:13 AM Hi Carson, hope your summer is off to a good start! In running a recent nf-core pipeline I’ve run in the past, I noticed a number of samples were not processed despite the pipeline completing: https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250529_methylseq/nf_report.html#. I think this may have to do with the ckpt time limit because I see the samples failed at around 4 hours and the pipeline_trace file (https://gannet.fish.washington.edu/metacarcinus/E5/Pevermanni/20250529_methylseq/pipeline_trace.txt) shows they weren’t requeued to a different node, which my config file used to specify before we added those changes to get around the previous –no-requeue issue. Let me know if you have any insights into this or any suggestions. Thanks! (edited)
gannet.fish.washington.edugannet.fish.washington.edu [ridiculous_murdock] Nextflow Workflow Report Nextflow workflow report for run id [ridiculous_murdock] Also sent as direct message
Shelly Wanamaker Jun 5th at 11:19 AM it could be a different problem, but in the past I would typically see the pipeline fail if samples weren’t processed
Carson Miller Jun 5th at 3:00 PM Hi Shelly, this is almost certainly due to the –no-requeue change that you mentioned. The easiest suggestion would be to use the Nextflow -resume feature and to change the submission queue to something besides checkpoint to “clean up” this ignored jobs. If you wanted to get fancier, you could add something like if (workflow.resume) { queue = “non-ckpt-queue” } to your config file so that resumed runs will always submit to something besides checkpoint!