上一篇,介绍了一下显著性的SNP,他们的解释表型变异百分比(PVE)之和,为何可能大于1.
https://yijiaobani.blog.csdn.net/article/details/122093602
这篇我们介绍一下GLM模型中,PVE的计算方法。
1. 数据描述
协变量是PCA的前三个,数据具体如下:
「表型数据:」
1641 1641 153.973423 1642 1642 113.119301 1643 1643 77.094801 1644 1644 89.293866 1645 1645 94.511433 1646 1646 134.410909 1647 1647 121.246759 1648 1648 45.699443 1649 1649 67.786229 1650 1650 102.225648
「协变量数据」
1641 1641 0.0633225 0.0285328 -0.00734127 1642 1642 0.0684039 0.0212648 -0.000664363 1643 1643 0.0609595 0.0242615 -0.00206211 1644 1644 0.0636631 0.0241012 -0.00275062 1645 1645 0.0741456 0.0293644 0.00068775 1646 1646 0.114417 0.0443451 -0.0408687 1647 1647 0.111599 0.0400401 -0.0320522 1648 1648 0.100862 0.0344925 -0.0386237 1649 1649 0.107986 0.028411 -0.0312307 1650 1650 0.108836 0.0377951 -0.0314308
「基因型数据:」
FID IID PAT MAT SEX PHENOTYPE M4_A M6_T M9_T M11_A M17_A M19_A M22_A M24_A M27_A M31_ 1641 1641 0 0 0 -9 0 1 2 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1642 1642 0 0 0 -9 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1643 1643 0 0 0 -9 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1644 1644 0 0 0 -9 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1645 1645 0 0 0 -9 0 1 2 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1646 1646 0 0 0 -9 1 0 0 0 0 0 0 2 2 2 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1647 1647 0 0 0 -9 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1648 1648 0 0 0 -9 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1649 1649 0 0 0 -9 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1650 1650 0 0 0 -9 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1651 1651 0 0 0 -9 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1652 1652 0 0 0 -9 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1653 1653 0 0 0 -9 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1654 1654 0 0 0 -9 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1655 1655 0 0 0 -9 2 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1656 1656 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1657 1657 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1658 1658 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2. GAPIT软件分析GLM
模型
GAPIT软件安装,见:如何安装GAPIT软件:https://zhuanlan.zhihu.com/p/268327005
「数据及代码下载,请关注公众号:育种数据分析之放飞自我,进入知识星球进行相关下载和学习」
「分析代码如下:」
library(data.table) raw = fread("plink.raw",header=T) raw[1:10,1:10] map = fread("recode.map",header=F) head(map) rr = raw[,!c(1,3:6)] names(rr) = c("taxa",map$V2) mm = map[,c(2,1,4)] names(mm) = c("Name","Chromosome","Position") head(mm) rr[1:10,1:10] fwrite(rr,"mdp_numeric.txt",sep="\t") fwrite(mm,"mdp_SNP_information.txt",sep="\t") # # GWAS 分析 library(data.table) source("http://zzlab.net/GAPIT/GAPIT.library.R") source("http://zzlab.net/GAPIT/gapit_functions.txt") myGd = fread("mdp_numeric.txt",header=T,data.table = F) myGM = fread("mdp_SNP_information.txt",header = T,data.table=F) myY = fread("dat_plink.txt",data.table = F) head(myY) covar = fread("cov_plink.txt",data.table = F)[,-1] names(covar)[1] = "Taxa" head(covar) myGAPIT = GAPIT(Y = myY[,c(1,3)],GD = myGd, GM = myGM, # PCA.total=3, CV = covar, model="GLM")
「注意:」
这里使用的是plink格式的表型和PCA结果,使用的是plink.raw文件为基因型数据 将其转化为gapit软件需要的格式 定义基因型和位置信息,定义表型,定义协变量,定义模型为GLM 结果文件为: GAPIT.GLM.V3.GWAS.Results.csv
3. GLM模型结果文件
结果文件如下: 包括:
SNP名称 染色体 物理位置 P值 # 重点关注 maf Rsquare.of.Model.without.SNP # 重点关注 Rsquare.of.Model.with.SNP # 重点关注
SNP,Chromosome,Position ,P.value,maf,nobs,Rsquare.of.Model.without.SNP,Rsquare.of.Model.with.SNP,FDR_Adjusted_P-values,effect M57157,12,44,0.0000000265778189052299,0.375,1000,0.0178325320026741,0.0488405467802565,0.000799726570858368,7.5184135943861 M57155,12,44,0.000000149990719134015,0.3275,1000,0.0178325320026741,0.0454335956044116,0.00115674059672426,7.24232622801506 M57156,12,44,0.000000149990719134015,0.3275,1000,0.0178325320026741,0.0454335956044116,0.00115674059672426,7.24232622801506 M98663,20,73,0.00000015377076726145,0.222,1000,0.0178325320026741,0.0453847644903734,0.00115674059672426,-7.81213756957444 M73233,15,64,0.000000512284630336812,0.2255,1000,0.0178325320026741,0.0430299031398552,0.00290006393781271,-7.52596405927775 M8452,2,69,0.000000578277953701437,0.435,1000,0.0178325320026741,0.0427934752525868,0.00290006393781271,-6.51962588787404 M45998,10,20,0.000000768626039613293,0.411,1000,0.0178325320026741,0.0422387923854602,0.00307979999092427,-6.69120649416338 M82957,17,57,0.00000112940850557263,0.1685,1000,0.0178325320026741,0.0414897712489706,0.00307979999092427,8.48568078650835 M82958,17,57,0.00000112940850557263,0.1685,1000,0.0178325320026741,0.0414897712489706,0.00307979999092427,8.48568078650835
这里,
P值 为SNP的P值,我们根据它判断显著性,并根据它进行QQ图和曼哈顿图的绘制 Rsquare.of.Model.without.SNP # 这个是单位点不包括次SNP的解释百分比(R方) Rsquare.of.Model.with.SNP # 这个是单位点包括此SNP的解释百分比(R方)
「上面两者之差,即为该SNP的解释百分比(PVE)」
$$SNP的PVE = Rsquare.of.Model.with.SNP - Rsquare.of.Model.without.SNP$$
4. 我们将结果读入R语言计算
首先读入R语言:
r$> re = fread("GAPIT.GLM.V3.GWAS.Results.csv") r$> head(re) SNP Chromosome Position P.value maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values effect 1: M57157 12 44 2.657782e-08 0.3750 1000 0.01783253 0.04884055 0.0007997266 7.518414 2: M57155 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326 3: M57156 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326 4: M98663 20 73 1.537708e-07 0.2220 1000 0.01783253 0.04538476 0.0011567406 -7.812138 5: M73233 15 64 5.122846e-07 0.2255 1000 0.01783253 0.04302990 0.0029000639 -7.525964 6: M8452 2 69 5.782780e-07 0.4350 1000 0.01783253 0.04279348 0.0029000639 -6.519626
然后我们根据上面的公式,手动计算PVE值:
r$> re$PVE = re$Rsquare.of.Model.with.SNP - re$Rsquare.of.Model.without.SNP r$> head(re) SNP Chromosome Position P.value maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values effect PVE 1: M57157 12 44 2.657782e-08 0.3750 1000 0.01783253 0.04884055 0.0007997266 7.518414 0.03100801 2: M57155 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326 0.02760106 3: M57156 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326 0.02760106 4: M98663 20 73 1.537708e-07 0.2220 1000 0.01783253 0.04538476 0.0011567406 -7.812138 0.02755223 5: M73233 15 64 5.122846e-07 0.2255 1000 0.01783253 0.04302990 0.0029000639 -7.525964 0.02519737 6: M8452 2 69 5.782780e-07 0.4350 1000 0.01783253 0.04279348 0.0029000639 -6.519626 0.02496094
根据PVE的大小,进行排序,降序:
r$> re %>% arrange(-PVE) SNP Chromosome Position P.value maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values effect PVE 1: M57157 12 44 2.657782e-08 0.3750 1000 0.01783253 0.04884055 0.0007997266 7.5184135944 3.100801e-02 2: M57155 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.2423262280 2.760106e-02 3: M57156 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.2423262280 2.760106e-02 4: M98663 20 73 1.537708e-07 0.2220 1000 0.01783253 0.04538476 0.0011567406 -7.8121375696 2.755223e-02 5: M73233 15 64 5.122846e-07 0.2255 1000 0.01783253 0.04302990 0.0029000639 -7.5259640593 2.519737e-02 --- 30086: M58296 12 66 9.996606e-01 0.2700 1000 0.01783253 0.01783253 0.9997934933 0.0006040638 1.785355e-10 30087: M751 1 15 9.998043e-01 0.2435 1000 0.01783253 0.01783253 0.9998960441 0.0003736382 5.935820e-11 30088: M21736 5 35 9.998296e-01 0.3020 1000 0.01783253 0.01783253 0.9998960441 0.0002955665 4.500680e-11 30089: M80677 17 12 9.999131e-01 0.3830 1000 0.01783253 0.01783253 0.9999300138 -0.0001448323 1.170910e-11 30090: M57593 12 52 9.999300e-01 0.2735 1000 0.01783253 0.01783253 0.9999300138 -0.0001300807 7.589301e-12
可以看到,结果就给出了PVE从大到小的排序结果。
这里,我们注意到,前三个SNP的PVE分别是:
3.1% 2.76% 2.76% 而且,这三个SNP都在12号染色体的44个位置,因为这三个位点离得很近,都达到显著水平,PVE也很大,那我们可以认为附件有一个基因,显著影响表型。
这里,我们对PVE进行求和:
r$> sum(re$PVE) [1] 57.6114
可以看到,总的PVE之和为57.611,远远大于1,是因为有些位点高度LD,所以PVE之和就很大。相关问题在 GWAS分析中SNP解释百分比PVE | 第一篇,SNP解释百分比之和为何大于1?中有过介绍。
5. 用R语言如何计算?
简单来说,就是单位点的回归分析,计算R方。
这里,我们用同样的数据,在R中进行GLM的GWAS分析。
代码如下:
library(data.table) geno = fread("plink.raw")[,!c(1,3:6)] map = fread("recode.map") m012 = geno names(m012) = c("IID",map$V2) phe = fread("phe.txt") cov1= fread("tcov.txt") dat = left_join(phe,cov1,by = c("V1"="V1")) %>% dplyr::select(IID =1, y =2, pc1=3,pc2=4,pc3=5) dat1 = left_join(dat,m012,by="IID") # dat1$M57157 = as.factor(dat1$M57157) mod_1 = lm(y ~ 1+pc1 + pc2 + pc3 + M57157,data=dat1);summary(mod_1) summary(mod_1)$r.squared mod_2 = lm(y ~ 1+pc1 + pc2 + pc3 ,data=dat1);summary(mod_2) summary(mod_2)$r.squared
这里,我们将PCA的前三个作为协变量加到回归分析汇总。
包含SNP的回归分析:
r$> mod_1 = lm(y ~ 1+pc1 + pc2 + pc3 + M57157,data=dat1);summary(mod_1) Call: lm(formula = y ~ 1 + pc1 + pc2 + pc3 + M57157, data = dat1) Residuals: Min 1Q Median 3Q Max -85.094 -19.174 -0.443 18.218 79.181 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 96.408 1.320 73.048 < 2e-16 *** pc1 -108.543 27.603 -3.932 9.00e-05 *** pc2 28.740 28.981 0.992 0.3216 pc3 58.098 27.890 2.083 0.0375 * M57157 7.518 1.320 5.695 1.62e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 27.6 on 995 degrees of freedom Multiple R-squared: 0.04884, Adjusted R-squared: 0.04502 F-statistic: 12.77 on 4 and 995 DF, p-value: 3.838e-10
可以看到,R方为0.04884,也可以这样提取:
r$> summary(mod_1)$r.squared [1] 0.04884055
不包含SNP的回归分析:
r$> mod_2 = lm(y ~ 1+pc1 + pc2 + pc3 ,data=dat1);summary(mod_2) Call: lm(formula = y ~ 1 + pc1 + pc2 + pc3, data = dat1) Residuals: Min 1Q Median 3Q Max -89.495 -19.318 -0.099 18.649 85.448 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 102.0467 0.8864 115.129 < 2e-16 *** pc1 -111.8205 28.0295 -3.989 7.11e-05 *** pc2 -21.6604 28.0295 -0.773 0.44 pc3 35.1350 28.0295 1.254 0.21 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 28.03 on 996 degrees of freedom Multiple R-squared: 0.01783, Adjusted R-squared: 0.01487 F-statistic: 6.028 on 3 and 996 DF, p-value: 0.0004546
这里的R方为:0.01783
也可以这样提取:
r$> x2 = summary(mod_2)$r.squared summary(mod_2)$r.squared [1] 0.01783253
那么SNP:M57157的PVE为:3.1%
r$> x1 - x2 [1] 0.03100801
对比我们GAPIT的PVE结果:
结果完全一致。
这里,一般线性模型中,可以针对显著性的SNP,进行单位点回归分析,计算PVE。对于混合线性模型,也可以将显著性位点提取,进行R语言的手动计算,这个也是PVE计算的一种方法。
混合线性模型中,还有其它的计算方法,我们后面进行介绍,欢迎继续关注我。
6. 数据和代码下载方法
「数据及代码下载,请关注公众号:育种数据分析之放飞自我,进入知识星球进行相关下载和学习」
转载本文请联系原作者获取授权,同时请注明本文来自邓飞科学网博客。
链接地址:https://wap.sciencenet.cn/blog-2577109-1317741.html?mobile=1
收藏