stats

attempted data transformations to see if it would help make the data more normal

  • log
  • logit
  • arcsin(sqrt) transform and ANOVA
  • still zero-inflated
  • best option seems to be binomial glm
    • to do this I need to generate # methylated Cs and unmethylated cs
    • save intermediate files in methylpy DMRfind
    • concatenate all .tsv files
      • these have already combined the counts from neighboring sites
    • intersect with filtered bed file
    • sum mCs and sum un-mCs for each DMR in each sample
      • most numbers match, but some disagree. I don’t understand why. Contacted yupeng to see if there is a better way to get this data
    • Turns out I can run a binomial glm on proportion data (http://www.flutterbys.com.au/stats/tut/tut10.5a.html)
    • binomial glm needs integers not continuous data
  • For now, going with arcsin(sqrt) transformation and ANOVA, uncorrected pval of 0.1
    • liberal selection because model is a little unfair towards the data since it’s not exactly normal (still zero-inflated)
    • selected DMRs still seem legit and show group differences
    • arcsin DMR selection at pvalue < 0.1 is more stringent than untransformed data in ANOVA so that’s good
  • Would still be better to use a glm but I’m waiting to see what Yupeng says since it’s not easy for me to regenerate that info.