Tues. Apr 3-7, 2020 Geoduck DMR genome features

Genomic Feature Analysis

Pipeline overview:

Make master genome feature bed file
- download feature files (.gff and beds) from OSF
- concatenate files
Use bedtools intersect to match DMRs to features in master genome feature file
Generate background regions
- find all CpGs covered by at least 3/4 samples per treatment group
- then find CpGs common to all groups within each comparison (all ambient samples, all day 10 samples, etc.)
- bin features in master genome feature bed file (2 kb)
  - the purpose of this is to make background regions more similar to DMRs; DMRs are between 6 and ~2kb. Some features are much larger (e.g. intergenic region can be > 10kb).
- determine binned features that overlap with covered CpGs
- filter features for those with at least 3 covered CpGs
determine if the number of DMRs within certain feature categories is significantly different than the number of covered regions within certain feature categories
- read data into R
- create contingency tables for each comparison

20200403_feature_analysis.ipynb: Jupyter notebook that performs pipeline steps 1-3
20200406_feat_stats.Rmd: R markdown script that performs Chi square statistical analysis and visualization of genomic feature overlap

reduce the number of features
- remove rRNA since no data overlap with these
- consolidate exon, CDS, gene, mRNA, etc.
add new features
- UTR
- putative promoter
re-run analyses with updated features