run fastqc and trim
- I ended up getting this to run on Ostrich via this jupyter notebook:
-
The multiqc instance I installed on Ostrich via didn’t seem to work properly (errors in jupyter notebook output) so I ran this on emu and it worked
scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ multiqc . [INFO ] multiqc : This is MultiQC v0.9 [INFO ] multiqc : Template : default [INFO ] multiqc : Searching '.' [INFO ] fastqc : Found 92 reports [INFO ] multiqc : Report : multiqc_report.html [INFO ] multiqc : Data : multiqc_data [INFO ] multiqc : MultiQC complete srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rsync --archive --progress --verbose multiqc_data strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722 Password: building file list ... 6 files to consider multiqc_data/ multiqc_data/.multiqc.log 6,880 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=4/6) multiqc_data/multiqc_fastqc.txt 20,380 100% 19.44MB/s 0:00:00 (xfr#2, to-chk=3/6) multiqc_data/multiqc_general_stats.txt 6,880 100% 6.56MB/s 0:00:00 (xfr#3, to-chk=2/6) multiqc_data/multiqc_report.html 2,567,171 100% 18.14MB/s 0:00:00 (xfr#4, to-chk=1/6) multiqc_data/multiqc_sources.txt 12,078 100% 87.37kB/s 0:00:00 (xfr#5, to-chk=0/6) sent 2,614,141 bytes received 500 bytes 747,040.29 bytes/sec total size is 2,613,389 speedup is 1.00 srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm -r multiqc_data/
- Raw sequence FASTQC output folder:
- Raw sequence MultiQC report (HTML):
-
CONCLUSIONS:
- TOTAL READS:
- Attempted 1600M reads, ended up with 1034M reads (65%) going to samples
- flowcell was under-loaded (which Dolores said it would likely be since overloading can ruin all the data).
- SAMPLE DEPTH:
- Salmon: ranges from 25M - 44M, median and mean around 32M
- Sea Lice: 226M for Female 1 and 163M for Female 2
- TOTAL READS:
- TrimGalore! output folder (adapter trimmed FastQ files):
- Adapter trimming MultiQC report (HTML):
- TrimGalore hard trim output folder (first 5bp trimmed from 5’ of each read):
- Hard trim MultiQC report (HTML):
copy the salmon genome and sea lice genome to mox
- Steven shared the C_rogercresseyi on this CaligusLIFE Slack channel thread.
-
I downloaded it locally and then copied it to Mox:
Shellys-MacBook-Pro:coverage Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Caligus/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 19.75MB/s 0:00:24 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 14122829.23 bytes/sec total size is 515420160 speedup is 1.00
-
I also copied it to Gannet
Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/Caligus building file list ... 1 file to consider Caligus_rogercresseyi_Genome.fa 515420160 100% 12.72MB/s 0:00:38 (xfer#1, to-check=0/1) sent 515483225 bytes received 42 bytes 13050209.29 bytes/sec total size is 515420160 speedup is 1.00
-
-
I downloaded the mgannetost recent RefSeq version of the S.salar genome on gannet:
ostrich:RefSeq strigg$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz --2019-07-19 12:21:17-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz => ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::11 Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2 ... done. ==> SIZE GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... 759073402 ==> PASV ... done. ==> RETR GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... done. Length: 759073402 (724M) (unauthoritative) GCF_000233375.1_ICSASG_v2_genomic.fna.gz 100%[=================================================================================================================================================================>] 723.91M 631KB/s in 19m 53s 2019-07-19 12:41:12 (621 KB/s) - ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ saved [759073402]
-
then copied the S. salar RefSeq genome to Mox:
Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq/GCF_000233375.1_ICSASG_v2_genomic.fna.gz strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Ssalar/GENOMES Password: Enter passcode or select one of the following options: 1. Duo Push to Android (XXX-XXX-0029) 2. Phone call to Android (XXX-XXX-0029) Duo passcode or option [1-2]: 1 building file list ... 1 file to consider GCF_000233375.1_ICSASG_v2_genomic.fna.gz 759073402 100% 7.68MB/s 0:01:34 (xfer#1, to-check=0/1) sent 759166220 bytes received 42 bytes 7195888.74 bytes/sec total size is 759073402 speedup is 1.00
-
copy data from gannet to owl
-
Sam is helping with this. See github issue #713
-
I still need to add the URLs for the FASTQC files
NEXT STEPS:
- Concatenate data from different lanes together (L001 + L002 for each sample)
- transfer concatenated trimmed reads from gannet to Mox
- determine alignment parameters for:
- subset of Calig. data
- subset of Salmon data
- run bismark on all data
Geoduck genome call notes:
method: assembly = 10X + phase overlay and illumina data used for QC
genome versions:
- v70
- MAKER ids 53000 transcripts
- BUSCO: 84% complete, 78% single copy, 5% dup; 2Gb, N50 = 6Mbp
- v074 (18scaffolds; not using BLAST to call genes use the gff maker generates)
- MAKER: 1700 transcripts; instead map raw transcriptome reads to assembly and see if they map to more than 1700 transcriptome
- BUSCO: 71% complete, 70% single copy, 0.9% dup; 1.42Gb (942Mbp after QC), N50 = 58Mbp
- Steven’s alignments of subset of data with score min -0.6 had ~40% mapping
- maybe take Steven’s data and do the same analysis to compare coverage with my previous analysis of v70, v71 and v73; or do the scaffold coverage analysis I previously did but on v074. Then see how things map
Next steps:
- figure out what’s going on with largest scaffold: do transcripts map to it? yes
- Use transcriptome as positive control: tissue specific should give estimate of # of transcriptome (Sam’s doing this analysis)
- pick samples to run against 74, and find # of loci with min cov 10
- DML (Hollie)
- DMR 1000bp window (shelly)
- Gene Body analysis
- Determine if we should deduplicate or not
- Do we need PacBio or Nanopore to validate? shouldn’t need validation with all the data we have
Lab meeting notes
Yaamini mentioned GO MWU that does rank based GO enrichment (does MW test instead of fisher)
Oyster proteomics paper: make corrections and don’t overclaim, then resend to Steven
Geoduck broodstock pH exp.: start writing manuscript