# R语言统计入门课程推荐——生物科学中的数据分析 Data Analysis for the Life Sciences是哈佛大学PH525x系列课程——生物医学中的数据分析(PH525x series - Biomedical Data Science
)，课程全部采用R语言进行统计分析理论教学与实战。教材采用Rmarkdown语言编写，易轻松易读，又保证分析的可重复性，代表了科学界最先进的可重复计算要求，我们不仅可以系统学习一个生物学家所要掌握的统计知识，还能新手用代码实现，并达到CNS发表可重复代码的要求。  Rafael A Irizarry是哈佛大学公共卫生学院丹娜法伯癌症研究院的生物统计和计算生物学教授，有17年分析基因组数据的经验。

Michael I Love是北卡教堂山大学生统与遗传系助理教授。研究方向为利用统计模型发现基因组数据中的生物为规律，并开发了Bioconductor中开源统计软件。

## 教程大纲

https://genomicsclass.github.io/book/

### PH525x series - Biomedical Data Science

• R markdown source files
• ePub version on Leanpub
• Links to the HarvardX class pages
• External resources and books
• Finding more help for data analysis

### Chapter 0 - 简介Introduction

• Introduction [Rmd]
• Getting started [Rmd]
• Getting started exercises
• 数据操作dplyr introduction [Rmd]
• dplyr introduction exercises
• Mathematical notation [Rmd]

### Chapter 1 - 推理统计基础Inference

• 随机变量Random variables [Rmd]
• Random variables exercises
• 群体与样本Populations and samples [Rmd]
• Populations and samples exercises
• CLT and t-distribution [Rmd]
• CLT and t-distribution exercises
• CLT in practice [Rmd]
• CLT in practice exercises
• t-test in practice [Rmd]
• 置信区间Confidence intervals [Rmd]
• Power calculations [Rmd]
• Power calculations exercises
• Monte carlo [Rmd]
• Monte carlo exercises
• 排列检验Permutation tests [Rmd]
• Permutation tests exercises
• 关联研究Association tests [Rmd]
• Association tests exercises

### Chapter 2 - 数据探索Exploratory Data Analysis

• Exploratory data analysis [Rmd]
• Plots to avoid [Rmd]
• Exploratory data analysis exercises

### Chapter 3 - 稳健统计Robust Statistics

• Robust summaries [Rmd]
• Rank tests [Rmd]
• Robust summaries exercises

### Chapter 4 - 矩阵代数Matrix Algebra

• 回归Introduction to using regression [Rmd]
• Introduction to using regression exercises
• Matrix notation [Rmd]
• Matrix notation exercises
• Matrix operations [Rmd]
• Matrix operations exercises
• Matrix algebra examples [Rmd]
• Matrix algebra examples exercises

### Chapter 5 - 线性模型 Linear Models

• Linear models introduction [Rmd]
• Linear models introduction exercises
• Expressing design formula [Rmd]
• Expressing design formula exercises
• Linear models in practice [Rmd]
• Linear models in practice exercises
• Standard errors [Rmd]
• Standard errors exercises
• Interactions and contrasts [Rmd]
• Interactions and contrasts exercises
• Collinearity [Rmd]
• Collinearity exercises
• QR and regression [Rmd]
• Linear models going further [Rmd]

### Chapter 6 - 推断高维数据Inference for High-Dimensional Data

• Introduction to high-throughput data [Rmd]
• Introduction to high-throughput data exercises
• Inference for high-throughput data [Rmd]
• Inference for high-throughput data exercises
• Multiple testing [Rmd]
• Multiple testing exercises
• EDA for high-throughput data [Rmd]
• EDA for high-throughput data exercises

### Chapter 7 - 统计模型Statistical Modeling

• Modeling [Rmd]
• Modeling exercises
• Bayes theorem [Rmd]
• Bayes theorem exercises
• Hierarchical models [Rmd]
• Hierarchical models exercises

### Chapter 8 - 降维Distance and Dimension Reduction

• Distance [Rmd]
• Distance exercises
• PCA motivation [Rmd]
• SVD [Rmd]
• SVD exercises
• Projections [Rmd]
• Rotations [Rmd]
• MDS [Rmd]
• MDS exercises
• PCA [Rmd]

### Chapter 9 - 机器学习Practical Machine Learning

• 聚类和热图Clustering and heatmaps [Rmd]
• Clustering and heatmaps exercises
• Conditional expectation [Rmd]
• Conditional expectation exercises
• Smoothing [Rmd]
• Smoothing exercises
• Machine learning [Rmd]
• Crossvalidation [Rmd]
• Crossvalidation exercises

### Chapter 10 - 批次效应Batch Effects

• Introduction to batch effects [Rmd]
• Confounding [Rmd]
• Confounding exercises
• EDA with PCA [Rmd]
• EDA with PCA exercises
• Adjusting with linear models [Rmd]
• Adjusting with linear models exercises
• Factor analysis [Rmd]
• Factor analysis exercises
• Adjusting with factor analysis [Rmd]
• Adjusting with factor analysis exercises

### Chapter 11 - 生物R包简介Introduction to Bioconductor

• Mike Love’s general reference card
• Motivations and core values (optional)
• Installing Bioconductor and finding help [Rmd]
• Data structure and management for genome scale experiments [Rmd]
• Coordinating multiple tables: ExpressionSet
• Institutional archives: GEO, ArrayExpress
• Interlude: Working with general genomic features using GenomicRanges
• IRanges introduced
• Intra-range operations
• Inter-range operations
• GRanges
• Calculating overlaps
• Range-oriented solutions for current experimental paradigms
• SummarizedExperiment: for RNA-seq and 450k methylation
• External storage for very large assays
• GenomicFiles for families of BAM or BED
• DNA Variants: VCF handling with VariantAnnotation and VariantTools
• Handling multiomic archives like TCGA
• Cloud-oriented solutions: e.g., Google BigQuery
• Short read mapping/alignment software (optional) [Rmd]

### Chapter 12 - 基因组注释Genomic Annotation with Bioconductor

• More details on GRanges [Rmd]
• Run-length encoding, views
• Application to genomic landmarks
• Application to 450k methylation array visualization
• General overview of Bioconductor annotation [Rmd]
• Levels: reference sequence, regions of interest, pathways
• Discovering reference sequence
• A build of the human genome
• Gene/Transcript/Exon catalogs from UCSC and Ensembl
• Importing and exporting regions and scores
• AnnotationHub: brokering thousands of annotation resources
• OrgDb: simple interface to annotation databases
• Finding and managing gene sets
• OrganismDb: unifying diverse annotation
• Cheat sheet on Bioconductor annotation [Rmd]
• Translating addresses between genome builds: liftOver [Rmd]

### Chapter 13 - 假设检验Genome-scale hypothesis testing with Bioconductor

• 区分生物重复和技术重复的变异Distinguishing biological and technical variability [Rmd]
• An experiment with pooled and individual samples
• Measuring technical variation
• Measuring biological variation
• Interpretation
• 多重比较Multiple comparisons with genewise t-tests [Rmd]
• Gene-wise testing
• Naive enumeration of genes
• Demonstrating danger of multiple testing with a set of sham comparisons
• Adjusting for multiplicity with qvalue
• Adjusted counts in the sham case
• Moderated t tests via limma [Rmd]
• A spike-in dataset
• Naive t-tests
• Three steps with limma: lmFit, eBayes, topTable
• Exposing the spiked-in genes
• A view of the shrinkage of variance estimates
• 基因集分析Introducing gene sets and gene set analysis [Rmd]
• Data wrangling
• A dataset for comparing expression by gender
• Finding surrogate variables/batch effect correction
• Identifier remapping
• Categorical testing
• Statistical summaries for sets: Wilcoxon
• Statistical summaries for sets: t statistics
• A permutation procedure

### Chapter 14 - 基因组数据可视化Visualization of genome scale data

• 可视化任务与策略A basic overview of visualization tasks and strategies[Rmd]
• Gene models
• Gene models plus data
• Driving visualizations with functions
• Using the browser to drive visualization functions via shiny
• Queriable dynamic displays with plotly
• Annotation-oriented visualizations
• Sketching the binding landscape over chromosomes with ggbio’s karyogram layout [Rmd]
• Plotting data in the context of genomic features with Gviz [Rmd]
• Visualizing NGS data [Rmd]
• Interactive visualization
• Graphical user interfaces for multivariate data with shiny [Rmd]
• Clustering gene expression data with shiny [Rmd]
• Final remarks on visualization [Rmd]

### Chapter 15: 并行与内存不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data

• Parallel computing with R and Bioconductor [Rmd]
• Demonstrating simple speedup in multicore environments
• Implicit parallelism with BiocParallel and GenomicAlignments
• External data: data interfaces that spare RAM[Rmd]
• SQLite for annotation
• Tabix-indexed BAM
• HDF5
• An illustration of NoSQL with S4: mongodb and RaggedMongoExpt[Rmd]
• Benchmarking various out-of-memory solutions[Rmd]
• Introduction to Bioconductor’s Amazon Machine Instance for cluster creation and use in EC2 [Rmd]
• Sharded GRanges for scalable integrative analysis[Rmd]

### Chapter 16: 多组学数据Multi-omic data integration

• Basic examples of multi-omic integration[Rmd]
• Transcription factor (TF) binding and gene coexpression in yeast
• TF binding and GWAS hits in humans
• Using RTCGAToolbox outputs to integrate clinical, mutation, expression and methylation assays[Rmd]
• Basic data acquisition
• Working with clinical data
• Defining a severity marker
• Extracting survival times
• Working with mutations
• Curation tasks for discrepant identifier formats
• Working with expression data
• Associating tumor stage with expression patterns
• Linking DNA methylation with expression patterns
• Application to visualization: kataegis and rainfall plot[Rmd]

### Chapter 17: Fostering reproducible genome-scale analysis

• Overview of unit on reproducibility[Rmd]
• Basic definitions
• Infrastructure requirements
• Statistical aspects of reproducibility
• Analysis of reproducibility probability (Boos and Stefanski 2011)
• Costs of highly reproducible designs
• Package structure, creation, installation, management[Rmd]
• What is a package?
• Using package.skeleton
• Using makeOrganismPackage
• Using devtools
• create() to set up folders and DESCRIPTION
• Composing documentation plus code
• document(), install()
• Conclusions, including a link to a recent Nature Toolbox article on Bioconductor

## 如何学习

https://genomicsclass.github.io/book/ 逐节阅读学习，内容较多。读者可挑选适合自己的章节学习即可。

Linux下使用git或wget下载

# 方法1. 解压后为labs-master目录
wget -c https://github.com/genomicsclass/labs/archive/master.zip
unzip master.zip

# 方法2. 下载为labs目录下
git clone git@github.com:genomicsclass/labs.git

## 写在后面   https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA

http://wap.sciencenet.cn/blog-3334560-1131943.html

## 全部精选博文导读

GMT+8, 2021-10-27 22:38