博文

理论的终结：数据洪流让科学方法变得过时

已有 6106 次阅读 2014-1-6 18:39 |系统分类:科研笔记| 理论, 科学方法, 过时, 终结, 数据洪流

理论的终结：数据洪流让科学方法变得过时

Chris Anderson

这已是五六年前的一篇文章了，中文翻译版和英文原版分别来源于：

http://www.yeeyan.org/articles/view/sylviaangel/9995

http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

中文版的部分专业词汇翻译不当，我修改了一下。

人类也许真的需要开始认真思考混沌理论对智能的影响了，当数据足够多的时候，也许任何模型都会失去效应，而必须用新的方法来理解。

“所有模型都是错的，但有些是有用的。” 30 年前，统计学家George Box这么说。他说的是正确的。但当时我们能有什么选择呢？只有模型，从宇宙哲学方程到人类行为原理的模型，才能连续的（尽管是不完美的）解释我们周围的世界。现在不同了。今天的公司（如谷歌)“生长”在充裕的大规模数据时代，它们不需要错误的模型。事实上，它们根本不需要模型。

60年前，数字化电脑让信息变得易读。20年前，英特网让信息变得易接触。10年前，第一个搜索引擎爬虫让信息变成一个单独的数据库。现在，诸如Google之类的公司正在经历历史上最标准的时代（一切皆可定量度量的时代），并将这些庞大文集作为人类社会图书馆。他们都是Petabyte（PB）时代的产物。（注：1 Petabyte =1024 TB）

Petabyte 时代是与众不同的，因为“更多”是一种与众不同。KB 级的信息存储在软盘里，MB级的信息存储在硬盘里，TB级的信息存储在硬盘阵列（disk arrays）里，PB级的信息存储在云（cloud）里。如果我们沿着信息存储进化的路线探寻，从类似文件夹，到类似文件柜，到类似图书馆，再到……在 PB级别，我们已经想不出有组织的类比了。

在 PB级别，信息不是简单三维、四维的分类和顺序，而是有维度不可知的统计数据。它需要一种完全不同的方法，一个需要我们放松（lose）对数据的约束，而将其视为能被形象化为一个整体的东西。它让我们先从数学角度看数据，然后为数据设立一个环境。例如，谷歌征服了广告世界，仅仅是通过应用了数学：它不需要自己了解文化和广告惯例知识。它仅仅是做了一个假设：更好的数据加上更好的分析工具将会赢得世界。而谷歌是正确的。

谷歌的奠基哲学就是“我们不知道为什么这张网页比那张网页好”：只要引入链接的统计数据说明它好就行了，并不需要语义上或者是因果关系上的分析。这就是谷歌不需要掌握一门语言就能翻译的原因（只要给以合适的数据，对于谷歌来说，把外星语（原文是Klingon，克林贡语）翻译成波斯语就和把法语翻译成德语一样容易）。这也是谷歌能在没有任何知识、对广告内容没有任何了解的情况下，能把广告和内容融合得这么好的原因。今年三月的O'Reilly 前沿技术会议（ O'Reilly Emerging Technology Conference，亦有人译作新技术峰会）上，Peter Norvig（谷歌研究主管）对 George Box的座右铭进行了更新：所有模型都是错误的，愈加地，你能在没有模型的情况下成功。（"All models are wrong, and increasingly you can succeed without them."）

这是一个大量数据和应用数学取代其他工具的世界。从语言学、社会学的人类行为原理里解脱吧。忘记分类、存在论和哲学吧，谁又能知道为什么人们要做他们做的事情？重要的是，他们“做事”的行为，而我们可以空前“高保真”地追踪并评估这一行为。拥有了足够的数据，数字也能说话。

尽管如此，最大的目标却不是广告，而是科学。科学方法基于可检验的假设之上的。大部分的模型，是科学家脑中形成的系统。于是，模型会被检验，并用实验来证实或伪造“世界如何工作”的理论模型。这就是科学家们几百年来一直使用的工作方法。科学家被训练得认识到：关联关系不一定是因果关系，若仅仅是因为X和Y之间的关联关系，则并不能据此得出结论（这只是巧合）。然而，你必须理解连接这两个变量的潜在因素，一旦你有模型，你就能够自信地连接起两个数据集。数据若没有模型，就只是“噪音”。

但是面对大规模数据，科学家“假设、模型、检验”的方法变得过时了。以物理为例：牛顿模型是近似真相的模型（牛顿模型在原子层面上是错误的，但是依旧有用）。100年前，基于量子力学的统计数据对真相进行了更好的描绘：但是量子力学也只是另一个模型而已，模型都是有缺陷的，模型无疑是对于更复杂的潜在真实的拙劣描述。我们不知道怎样操作那些伪造假设(能量太高，加速器太昂贵等等)的实验，这就是近几十年物理学研究转向对N维大统一理论（grand unified models）的原因.

现在，生物学也向同样的方向发展。我们在学校所教的“显性和隐形基因严格遵循孟德尔法则”的模型已被证明是比牛顿定律更简单的对事实的描述。基因蛋白质相互作用（gene-protein interactions ）和其他表观遗传学（epigenetics）的发现已经动摇了“DNA就是命运“的看法，甚至引入了“环境可以影响遗传特性”这些曾经在基因学上被认定为不可能的事情。简而言之，我们对生物学学得越透彻，我们发现自己离能解释生物的模型越远。

现在有一个更好的办法。Petabytes允许我们这么说：关联关系就已经够了。我们可以不再去寻找模型，我们能够不依靠假设来分析数据。我们能把数据扔到前所未见的最大计算机集群里，让统计算法找到那些科学所不能告诉我们的模式。

最好的实践例子就是： J. Craig Venter的鸟枪法基因测序（shotgun gene sequencing）。有了高速测序仪（sequencers）和超级计算机来解析它们产生的统计数据， Venter从单细胞体到整个生态系统都进行测序。在2003年，他开始海洋生物的测序，重溯COOK船长的旅行。在2005他开始对空气中的生物测序。他发现了上千种未知细菌和其他生命形式。

如果发现新物种让你想到达尔文和他画的那些雀类，你可能还囿于传统的科学研究方法。Venter几乎不能告诉你任何关于他所发现的物种的信息。他不知道他们长什么样，他们如何生存，还有其他关于他们形态学上的信息。他甚至没有他们完整的染色体组。他所拥有的只是统计性的”点“（blip）：一个与基因序列数据库不同的独特序列必定属于一个新的物种。而这个序列可能和其他我们熟知的序列关联。在这种情况下，Venter能对这些动物做一些猜测：这些动物利用独特的方法，把阳光转化为能源、或者他们继承自某一共同祖先。但除了这些，Venter对于此种生物并不比谷歌对于你的MySpace的模型更好。毕竟，这仅仅是数据。可是通过谷歌品质计算方法（Google-quality computing resources）进行分析，对于生物前沿知识，Venter懂得比其他同时代的人都多。

这种思维方法正在趋于主流。在二月，国家科学基金宣布，集群探索（ Cluster Exploratory 简称CluE ），致力于研究运行大规模分布计算机平台的项目将由谷歌和IBM以及六个试点学校一同进行。这个集群将包括1600个处理器，大量TB的内存，上百TB的存储器，还有包括GFS（Google File System）、IBM的 Tivoli、谷歌MapReduce的开源版等软件。早期的CluE项目将包括大脑和神经系统的模拟以及其他在湿件和软件之间的生物研究。（注：湿件即除了软件、硬件之外的“件”，即人脑）

学会在这个层次上使用“电脑”可能具有挑战性。但是机遇是很大的：海量数据的新用处，以及咀嚼这些数据的统计性工具，提供了一个理解世界的新方法。关联关系比因果关系重要，科学甚至能在没有一致模型、统一理论，甚至完全不需要任何解释的情况下进步。

我们没有理由坚持我们的老方法。现在是时候问这一句了：（传统）科学能从谷歌那儿学到什么？

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

"All models are wrong, but some are useful."So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don't have to settle for wrong models. Indeed, they don't have to settle for models at all.

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.

At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.

Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

Speaking at the O'Reilly Emerging Technology Conference this past March, Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them."

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility. In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.

This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM's Tivoli and open source versions of Google File System and MapReduce. Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software.

Learning to use a "computer" of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

There's no reason to cling to our old ways. It's time to ask: What can science learn from Google?

See also:

竹筏还是灯塔——数据洪流中的科学方法

http://songshuhui.net/archives/82772

（这是一篇评论文章，所评的正是美国《连线》杂志上的文章The End of Theory。否定和批评了Chris Anderson的观点，写的很好。）

在这篇评论文章的末尾有一位网友的留言，感觉值得一看：

这是一个很有趣的话题，不过如果作者对统计机器学习(Statistical Machine Learning)进行一些了解的话，大概会写的更加客观一些。

对以下几个观点，我有一些个人的粗浅见解，希望和大家交流：
（1）“这种纯粹建立在统计关联之上的结果具有无可避免的模糊性”
模糊性是人类认知的基础属性之一，虽然并非任何事物都有模糊性，但是在这样一个科学用于实践的时代，如果缺乏对模糊性的容忍，那么适应能力是无法保证的。简单举例，一个人类的小孩子见过两三种汽车，就可以在看到一种新的汽车的时候准确识别出来，这种“模糊”和泛化的能力，不是现有的任何一种精确模型可以描述的，反而是基于统计的方法能够在一定程度上模拟人类的观察、认知和学习（参见google的自动驾驶汽车）。

（2）“谷歌翻译能作为未来科学方法的范例吗？答案应该是不言而喻的。”
机器学习研究界有这样一个准则：解决一个问题，不应该以解决一个更加一般的问题作为中间步骤（Vapnik）。基于统计的方法，是对这个准则的最好践行。同样举作者谷歌翻译的例子，大量的统计结果已经可以达到比任何传统方法都好的翻译结果（否则谷歌为什么不去用传统方法，况且中文翻译本来就不是投入最大精力的），即使通过精确刻画语言模型达到了翻译的效果，但花费的精力已经远远大于完成“翻译”这一任务的需要，反而是不够合理的。再举例，小孩子在最初学习母语的时候，是完全不会从语法、构词等“语言模型”上学习的，而是直接从日常经验中学习，建立语言和意思的关联即可，反倒是从语法开始学习的外语，学习的过程会更加的艰难。

总之，基于统计的方法是目前为止，应对信息爆炸最有效的手段，同时也是一种科学的手段。希望读到本文的读者们，也能多查阅一些资料，了解一些最新的学界、业界动态，不要被一些“传统”的观点蒙蔽。并不是看起来很科学的方法，就是科学的方法，而是能解决问题的方法，才是科学的方法。

最后，我相信，因为数据本身的变革，未来的科学家中大概会有更多的统计学背景，但任何科学的研究方法都有其重要意义。

参考资料：
1、第四范式：数据密集型科学发现
2、统计学习理论（Vapnik）

《第四范式：数据密集型科学发现》

http://www.sciencep.com/qiyedongtai/wosheyaowen/2012-10-23/668.html

The Forth Paradigm: Data-Intensive Scientific Discovery

http://research.microsoft.com/en-us/collaboration/fourthparadigm/

More Is Different

http://www.sciencemag.org/content/177/4047/393.extract

More really is different

http://www.sciencedirect.com/science/article/pii/S0167278909000852

“Networks” is different