谢钢
关于统计数据分析的未来发展方向的一点看法
2022-1-17 18:06
阅读:2029

每个月我们科研处都会发一个月度简报给全校各科研中心及教学院系,每次我都会提供那么‘豆腐干’大小的一点与统计分析有关的东西与大家分享。以下是一篇关于‘机器学习’与‘统计分析模型’的联系的简短评述,英文原文如下:

The connection between Machine Learning (ML) and statistical data analysis models

Machine Learning refers to those algorithms used in computer software for analytical model building to find hidden insights in data without being explicitly programmed where to look or what to conclude.  In multivariate data analysis, statistical models such as Principal components analysis (PCA) and factor analysis (with respect to a group of continuous variables), or correspondence analysis (CA) and multiple correspondence analysis (MCA) (with respect to a group of categorical variables) are employed for identify the hidden clusters.  From machine learning’s perspective, these models are labelled as unsupervised ML algorithms, namely a clustering problem without referring to a target response variable.  On the other hand, we may apply a regression tree model (for a continuous response variable versus a group of predictor variables) or a classification tree model (for a categorical response variable) to determine the hidden patterns between the response variable and the predictor variables.  Tree models in statistical data analysis are labelled as supervised ML algorithms in artificial intelligence (AI).  By combining bootstrapping technique (i.e., resampling without replacement) with tree models, new data analysis models such as random forest trees were developed.  These new types of statistical models are named as statistical learning algorithms by statisticians.  Of course, at the same time AI people would consider random forest tree a standard ML algorithm.  In a sense, if we automate the model selection process for a linear regression model or a logistic regression model, say using AIC for model selection, such statistical practice maybe considered as the application of ML as well.  Then we have probabilistic network models such as neural network and Bayesian network models.  Deep learning, in a simplest way to define it, refers to the application of a multi-layer neural network model for learning from data.  People identified themselves as AI experts would use multi-layer neural network model more often, while people identified themselves as statisticians would use Bayesian network model more for different purposes in data analysis.  It is argued in literature that, the future of big data (or usual data) analysis is likely towards the Bayesian Deep Learning direction, namely, looking for a framework with two seamlessly integrated components: a perception component handled by the neural network algorithm and a task-specific component handled by the Bayesian network algorithm.     

大致上我想表达这样一些体会及观点。在统计分析模型里我们称作‘主成份分析’及‘(多变量)对应分析’的东西在搞‘机器学习’的人的词汇里就变成了‘无结果变量约束的机器学习算法’;而‘回归树模型’或‘分类树模型’就是‘有结果变量约束的机器学习算法’了。甚至把常规或广义的线性回归模型配合上AIC(赤池信息准则)进行‘自动识别’最优模型也可看作是一种‘有结果变量约束的机器学习算法’。当然,以上所举的这些模型/算法的评价标准都是以对新数据点的预测结果的准确度来衡量,是与统计假设检验的评价标准完全不同的思路。

当我们把计算机科学与传统的统计模型结合一下(比如对‘树模型’来一点‘可放回再抽样’(直译为‘鞋带自救法’))就有了像‘随机森林树’这样的东西。也有统计学家把这些模型称作‘统计学习算法’的。更进一步就到了‘神经网络’和‘贝叶斯网络’模型了。所谓的人工智能‘深度学习’模型就是多层结构的‘神经网络’模型。一篇最近发表的总结‘机器学习’的文章【1】提出了这样的观点(我很以为然):未来数据科学的最有希望的发展方向将是把‘深度学习网络’与‘贝叶斯网络’进行结合的探索。原因是,在大数据时代,‘神经网络’模型特别适于完成数据的输入及结构模式识别学习的分析,而‘贝叶斯网络’更适合从学科内容的角度进行建模,其结果也更容易从相关的科研问题的角度进行解释。因此,把两者无缝结合,由‘神经网络’部分构成其‘感知单元’,‘贝叶斯网络’构成其‘具体任务单元’- 这就形成了‘贝叶斯深度学习’模型来分析各种数据。

结论:未来数据分析发展空间里传统的统计假设检验的模式大概难有立足之地了。

我们这些从传统数理统计学出身的数据分析工作者们对自己未来的专业发展道路是时候应该好好思考一下了。

1Hao Wang & Dit-Yan Yeung 2021 “A Survey on Bayesian Deep Learning”, https://arxiv.org/pdf/1604.01662.pdf


转载本文请联系原作者获取授权,同时请注明本文来自谢钢科学网博客。

链接地址:https://wap.sciencenet.cn/blog-3503579-1321469.html?mobile=1

收藏

分享到:

当前推荐数:6
推荐到博客首页
网友评论5 条评论
确定删除指定的回复吗?
确定删除本博文吗?