Mon. Jul. 22, 2019 Salmon sea lice methylomes and Geoduck genome

run fastqc and trim

I ended up getting this to run on Ostrich via this jupyter notebook:
- 20190722_SalmonSeaLice_FASTQC_TRIM.ipynb

The multiqc instance I installed on Ostrich via didn’t seem to work properly (errors in jupyter notebook output) so I ran this on emu and it worked

  scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.zip srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc
  scp /Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722/*.html srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc
  srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ multiqc .
  [INFO   ]         multiqc : This is MultiQC v0.9
  [INFO   ]         multiqc : Template    : default
  [INFO   ]         multiqc : Searching '.'
  [INFO   ]          fastqc : Found 92 reports
  [INFO   ]         multiqc : Report      : multiqc_report.html
  [INFO   ]         multiqc : Data        : multiqc_data
  [INFO   ]         multiqc : MultiQC complete
  srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rsync --archive --progress --verbose multiqc_data strigg@ostrich.fish.washington.edu:/Volumes/web/metacarcinus/Salmo_Calig/analyses/20190722
  Password:
  building file list ... 
  6 files to consider
  multiqc_data/
  multiqc_data/.multiqc.log
            6,880 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=4/6)
  multiqc_data/multiqc_fastqc.txt
           20,380 100%   19.44MB/s    0:00:00 (xfr#2, to-chk=3/6)
  multiqc_data/multiqc_general_stats.txt
            6,880 100%    6.56MB/s    0:00:00 (xfr#3, to-chk=2/6)
  multiqc_data/multiqc_report.html
        2,567,171 100%   18.14MB/s    0:00:00 (xfr#4, to-chk=1/6)
  multiqc_data/multiqc_sources.txt
           12,078 100%   87.37kB/s    0:00:00 (xfr#5, to-chk=0/6)
	
  sent 2,614,141 bytes  received 500 bytes  747,040.29 bytes/sec
  total size is 2,613,389  speedup is 1.00
  srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.html
  srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm *.zip
  srlab@emu:~/GitHub/Shelly_Pgenerosa/multiqc$ rm -r multiqc_data/

Raw sequence FASTQC output folder:
- https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20190722/
Raw sequence MultiQC report (HTML):
- https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20190722/multiqc_data/multiqc_report.html
- CONCLUSIONS:
  - TOTAL READS:
    - Attempted 1600M reads, ended up with 1034M reads (65%) going to samples
    - flowcell was under-loaded (which Dolores said it would likely be since overloading can ruin all the data).
  - SAMPLE DEPTH:
    - Salmon: ranges from 25M - 44M, median and mean around 32M
    - Sea Lice: 226M for Female 1 and 163M for Female 2
TrimGalore! output folder (adapter trimmed FastQ files):
- https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20190722/TRIM_adapt/
Adapter trimming MultiQC report (HTML):
- https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20190722/TRIM_adapt/FASTQC/multiqc_data/multiqc_report.html
TrimGalore hard trim output folder (first 5bp trimmed from 5’ of each read):
- https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20190722/TRIM_1st5bp/
Hard trim MultiQC report (HTML):

copy the salmon genome and sea lice genome to mox

Steven shared the C_rogercresseyi on this CaligusLIFE Slack channel thread.

I downloaded it locally and then copied it to Mox:

  Shellys-MacBook-Pro:coverage Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Caligus/GENOMES
  Password: 
  Enter passcode or select one of the following options:
   1. Duo Push to Android (XXX-XXX-0029)
   2. Phone call to Android (XXX-XXX-0029)
  Duo passcode or option [1-2]: 1
  building file list ... 
  1 file to consider
  Caligus_rogercresseyi_Genome.fa
     515420160 100%   19.75MB/s    0:00:24 (xfer#1, to-check=0/1)
  sent 515483225 bytes  received 42 bytes  14122829.23 bytes/sec
  total size is 515420160  speedup is 1.00

I also copied it to Gannet

  Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose ~/Downloads/Caligus_rogercresseyi_Genome.fa /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/Caligus
  building file list ... 
  1 file to consider
  Caligus_rogercresseyi_Genome.fa
     515420160 100%   12.72MB/s    0:00:38 (xfer#1, to-check=0/1)
		
  sent 515483225 bytes  received 42 bytes  13050209.29 bytes/sec
  total size is 515420160  speedup is 1.00

I downloaded the mgannetost recent RefSeq version of the S.salar genome on gannet:

      ostrich:RefSeq strigg$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz
  --2019-07-19 12:21:17--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz
             => ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’
  Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.11, 2607:f220:41e:250::11
  Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.11|:21... connected.
  Logging in as anonymous ... Logged in!
  ==> SYST ... done.    ==> PWD ... done.
  ==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF/000/233/375/GCF_000233375.1_ICSASG_v2 ... done.
  ==> SIZE GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... 759073402
  ==> PASV ... done.    ==> RETR GCF_000233375.1_ICSASG_v2_genomic.fna.gz ... done.
  Length: 759073402 (724M) (unauthoritative)
	
  GCF_000233375.1_ICSASG_v2_genomic.fna.gz                            100%[=================================================================================================================================================================>] 723.91M   631KB/s    in 19m 53s 
	
  2019-07-19 12:41:12 (621 KB/s) - ‘GCF_000233375.1_ICSASG_v2_genomic.fna.gz’ saved [759073402] 

then copied the S. salar RefSeq genome to Mox:

  Shellys-MacBook-Pro:Caligus Shelly$ rsync --archive --progress --verbose /Volumes/web/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq/GCF_000233375.1_ICSASG_v2_genomic.fna.gz strigg@mox.hyak.uw.edu:/gscratch/srlab/strigg/data/Ssalar/GENOMES
  Password: 
  Enter passcode or select one of the following options:
		
   1. Duo Push to Android (XXX-XXX-0029)
   2. Phone call to Android (XXX-XXX-0029)
		
  Duo passcode or option [1-2]: 1
  building file list ... 
  1 file to consider
  GCF_000233375.1_ICSASG_v2_genomic.fna.gz
     759073402 100%    7.68MB/s    0:01:34 (xfer#1, to-check=0/1)
		
  sent 759166220 bytes  received 42 bytes  7195888.74 bytes/sec
  total size is 759073402  speedup is 1.00

copy data from gannet to owl

Sam is helping with this. See github issue #713
I still need to add the URLs for the FASTQC files

NEXT STEPS:

Concatenate data from different lanes together (L001 + L002 for each sample)
transfer concatenated trimmed reads from gannet to Mox
determine alignment parameters for:
- subset of Calig. data
- subset of Salmon data
run bismark on all data

Geoduck genome call notes:

method: assembly = 10X + phase overlay and illumina data used for QC

genome versions:

v70
- MAKER ids 53000 transcripts
- BUSCO: 84% complete, 78% single copy, 5% dup; 2Gb, N50 = 6Mbp
v074 (18scaffolds; not using BLAST to call genes use the gff maker generates)
- MAKER: 1700 transcripts; instead map raw transcriptome reads to assembly and see if they map to more than 1700 transcriptome
- BUSCO: 71% complete, 70% single copy, 0.9% dup; 1.42Gb (942Mbp after QC), N50 = 58Mbp
- Steven’s alignments of subset of data with score min -0.6 had ~40% mapping
- maybe take Steven’s data and do the same analysis to compare coverage with my previous analysis of v70, v71 and v73; or do the scaffold coverage analysis I previously did but on v074. Then see how things map

Next steps:

figure out what’s going on with largest scaffold: do transcripts map to it? yes
- Use transcriptome as positive control: tissue specific should give estimate of # of transcriptome (Sam’s doing this analysis)
pick samples to run against 74, and find # of loci with min cov 10
- DML (Hollie)
- DMR 1000bp window (shelly)
- Gene Body analysis
Determine if we should deduplicate or not
Do we need PacBio or Nanopore to validate? shouldn’t need validation with all the data we have

Lab meeting notes

Yaamini mentioned GO MWU that does rank based GO enrichment (does MW test instead of fisher)

Oyster proteomics paper: make corrections and don’t overclaim, then resend to Steven

Geoduck broodstock pH exp.: start writing manuscript