何毓琦的个人博客分享 http://blog.sciencenet.cn/u/何毓琦 哈佛(1961-2001) 清华(2001-date)

博文

“老大哥在看着你呢 ” (模式识别在我们生活中无处不在)(中英对照) 精选

已有 12459 次阅读 2009-7-13 21:10 |系统分类:科普集锦

Big Brother is watching you (Pattern Recognition in everyday lives) !
In this electronics-centric and globalized world, there is very little that goes on in our lives that is not known to someone else or collected in some databases. I wrote earlier in a blog article http://www.sciencenet.cn/m/user_content.aspx?id=35816
about “On Privacy and the Digital Revolution”. But many of us are still comforted by the thought that no one will take the trouble to analyze these zillions of data collected everyday. So long as we live a normal life and not engaged in crime or terrorism, no one will care. And even if we do, the chance are small for detection. But the science of pattern recognition makes this belief a somewhat chancy proposition.
So what is Pattern Recognition (PR)? And how does it work?
Intuitively and informally we all have an idea how PR works. Scientific discovery (e.g. Darwin’s theory of evolution and natural selection) happens when someone observed some phenomena, thought about them, and derived a concise explanation of them. We form opinions about human behavior by seeing what people do. Military intelligence depends on piecing together seemingly disparate data and observations. The theory of PR is simply a codification of these intuitive and informal ideas into precisely defined rules and algorithms so that the vast power of computers can be used in place of human efforts. I shall briefly explain below what I know about this codification process as popular science. In the process, perhaps we will also gain a glimpse on how science and technology are increasingly taking over human endeavors.
Roughly speaking there are four ingredients and steps in PR – feature selection, training and classification, generalization, syntax analysis.
1.     Feature Selection – In principle, once you have digitized whatever you want to pattern-recognize, you can simply process the resultant bits. But this is highly inefficient or impossible. For example, digital data about a human face involves billions of bits. What you need to do first is to abstract from these raw bits some “features” that can be used to characterize the data, such as, the color of the skin, and the shape of the eyes and eyelids. These two features might be useful in recognizing whether or not a face belong to an Asian or non-Asian. How to select the correct and the minimal number of features is an important problem of PR.
2.     Training for classification – Again in principle any PR problem can be reduced to a series of yes-no classification problem (remember the “20 questions” game we often play as a child?). Let us then consider the classification problem of differentiating between an Asian face and a non-Asian face. Assume we shall use the two features – color of the skin, and shape of the eyes (simplification for explanation purposes only) to represent a face, and we have a large collection of samples of Asian and non-Asian faces. Furthermore if these two features can be converted to numerical scales; we randomly choose half of this set of sample faces, and plot the data set in two dimensions. A typical situation of figure 1 (see beloow) may result where squares denote samples of non-Asian face and circles are that of Asian faces.
To implement a classifier means to device a two dimensional curve that will separate the squares from the circles. The simplest curve is a straight line which can be characterized by two parameters. Using the samples we can device a successive approximation scheme (called training) which can gradually adjust the value of the parameters of the line such that it can separate the two groups of data with minimal errors as shown in Fig. 2. (see below)
 
 
Of course at this point astute readers may question why not use a more complex curve so that an even better separation can result as in Fig.3. (see below)
Good question! The answer lies in the issue of
3.     Generalization – The quality or goodness of a classifier must be tested on new samples previously not used for training purposes. Thus once we designed a classifier (i.e., in our case, fixed the value of the two parameters characterizing the separating line in Fig. 2), we need to test it on the remaining half of the sample data to see if it performs equally well as it did on the training data, i.e., the ability to “generalize” its classifying property on new data. This is a well known problem in the “Statistics” literature – how well does the sample property predict the true property? For example, if the sample size used to calculate whatever property is small then the predictive quality of the sample property thus calculated will be poor, i.e., it cannot generalize. Thus, when we use a more complex curve to do the classification on the samples, the number of characterizing parameters of the curve must be “small” relative to the sample size. Thus, if we have to add three more parameters for the curve in order to eliminate errors on the classifications of , say, two pieces of samples data, this is probably not a good idea. But if we can eliminate three hundred pieces of errors, then we should do it.
4.     Syntax Analysis The phrase “the context makes it clear” denotes the fact that the meaning of words often depends on how it was used, e.g., neighboring words in a sentence. Thus, in classifying “faces” we may want to look at the surrounding scenes in which we find the “face”. If we find many Asian “features” such as Chinese words and signs, then the classifying probability of an Asian face should probably be increased. Put it another way, we can escalate PR to a higher level and practicing the art of “feature selection-training-generalization” on increasing amount of data. This is of course a never ending problem which is why general language translation and artificial intelligence problem remain unsolved. However, in more limited and prescribed situation of PR for a particular purpose much can be and have been done.
 
Examples of PR being used in our daily lives are numerous in the US: From our supermarket shopping, companies know a great deal of the foods we eat and can target advertisements to families individually. So does Amazom.com on books and things you purchase. Magazines can print individual ads for the copy we receive since they know our likes and dislikes from similar databases. I have no doubt that all e-mail traffic between China and US are monitored by both sides for intelligence purposes. However, with ever increasing computing power and ever more collection of electronic data, who is to say that spare computing capacity won’t be put to use to discover whatever patterns Big Brother wants to know. Your Google search and cell phone records are all available to the government. There is no privacy anymore.
Notes added 7/15/09 . A note in Campus Technology newsletter today reported the following example of PR:
Carnegie Mellon Researchers Find SocialSsecurityNumberss Can Be Predicted
Carnegie Mellon University researchers have shown that public information gleaned from governmental sources, commercial databases, and online social networks can be used to routinely predict most--and sometimes all--of an individual's nine-digit Social Security number.
http://www.1105newsletters.com/t.do?id=2951907:780897



“老大哥在看着你呢” (模式识别在我们生活中无处不在)

在如今这个电子技术高度发达和国际化的世界中, 我们日常生活中的经历也往往逃不过别的或者某些数据库的眼睛。我早些日子写过一篇关于“隐私和电子革命”的博文(http://www.sciencenet.cn/m/user_content.aspx?id=35816)。但是,我们中的很多人仍然会心中暗自松一口气,因为他们以为没有人会不怕麻烦地来分析日积月累下来的海量数据。到现在为止只要我们还过着“比较正常” 的生活, 不从事犯罪或恐怖活动, 一般没有人会担心(自己的行踪被记录在案)。 即使担心, 也觉得自己的资料被从茫茫数据中发掘出来的机会很小。但是随着模式识别技术的发展,这种想法的风险越来越大了。

 
那么到底是什么是模式识别(Pattern Recognition)呢?它又是如何运作的呢?


我们其实凭着直觉大都多少知道一些模式识别的运作方式。 科学发现(例如达尔文的生物进化论和自然选择论) 源于人类对某种现象的观察、思考并从中提炼出的精要解释。我们通过观察别人的行为形成自己的观点。把看似无关的数据和观测结果拼接起来就得到了军事情报。 模式识别的原理简单说来就是把这些靠直觉得到的非正式的的想法汇编成精确定义的法则和算法, 这样我们就可以利用电脑的强大计算能力来取代人工。在下面几段里,我将简要解释一下我对这一汇编过程的理解。同时, 也希望能帮助大家初步了解到科学技术是如何越来越多地减轻人类劳动的。

简单说来,模式识别主要包括四个方面和步骤:特征选择、训练和分类、泛化以及语境分析。


1. 特征选择(Feature Selection)。理论上说, 一旦我们将想要模式识别的对象电子化,你只需 处理得到的字节就行了。但是这个过程极其低效甚至是不可行的。 比如说, 描述一张人脸的电子数据高达数十亿字节。我们首先需要从这些原始数据中提炼出某些“特征”(Feature), 例如皮肤的颜色,眼睛眼睑的轮廓, 用这些特征来分类定义原始数据。 这两个特征有可能帮助你识别这张脸是否属于亚洲人。 如何选择正确选择数量最少的特征就成为了模式识别中非常重要的一个问题。


2. 分类训练(Training for Classification)。同样的,理论上说,任何模型识别的问题都可以被简化为一系列问题,问题的答案非此即彼。(记得我们小时候常玩的 “20个问题”的 游戏吗? ) 我们再回到如何区分亚洲人脸和非亚洲人脸的问题上来。为了简化问题,假设我们只使用两个特征来描述一张脸——皮肤颜色和眼部轮廓。我们有很多亚洲脸和非亚洲人脸的数据。再假设以上两个特征可以用数字来量化.。我们随机抽取其中一半的数据, 把这些数据绘制在两维坐标系中。一个典型的情况, 如图一所示, 用方形点表示非亚洲人的数据,圆形点表示亚洲人的数据。


如果我们想要归纳出一个分类标准,就意味着要在图示的二维坐标系中描绘出一条曲线将方形点和圆形点分开。 这类曲线中最简单的一种是使用两个参数的直线。 使用这些数据我们可以得到一个连续的拟合方案(叫做 “训练” Training),使用该拟合方案,我们可以逐步校正描述曲线的参数的值,直到该曲线能以最少的错误率将两组数据分开(见图二)。


当然这时候敏锐的读者可能要问了,干嘛我们不用更加复杂的曲线以取得更准确的分类?这个问题问得好。 答案就在下面:


3. 泛化(Generalization)。 判断一个分类模型的好坏,必须使用以前未曾被“训练”过的新数据来测试。 所以,一旦我们做出了一个分类模型(比如, 在这篇文章中,当用一半的数据得到了如图二所示的曲线中两个参数的值后),我们就需要验证该分类模型的泛化能力,即该模型是否能够同样作用于另外一半数据。统计学里的一个著名问题就是:样本的特性预测真实特性的能力有多强。 比如说,, 如果样本很小而导致预测出的特性与真实情况偏差较大,就说明这个模型无法泛化。因此,如果我们使用更复杂的曲线来给样本数据归类,参数的 数量相对于样本的体量来说必须保持较少的规模。所以如果增加三个参数可以消除两个样本数据的错误,就不是一件划算的事情。 但是如果增加这三个参数能消除三百个错误数据,我们就觉得很划算。


4. 语法分析(Syntax Analysis)。人们常说“语境决定语义”,意思是词语的意思很多时候取决于上下文,比方说, 同一个句子中临近的其他词。因此, 为了把关于脸的数据进行合理分类,我们可能应该考察一下脸出现的那些场合。 比如,如果我们发现了很多亚洲特征,如汉字和中文符号, 那么把这个数据分类为亚洲人的可能性就应当增加。换而言之, 我们可以将模式识别提升到更高的层次, 在更多的数据上运用“特征选择-分类训练-泛化”的方法。 这当然是一个永远无法完全解决的问题, 这也是为什么语言翻译和人工智能问题仍然未被攻克。 但是, 如果为了特定的目的进行模式识别,而且情况是被限定的、有限制的,那么这是可以做到的,而且已经做到了。


模式识别已经在美国的日常生活中得到广泛应用: 在超市购物时,商家知道消费者喜爱什么食品,所以能够单独针对个别家庭做广告。 亚马逊网(amazon.com) 在销售书籍和物品时也是使用了模式识别。 杂志社能够通过分析相关数据库得出每位读者的喜好,刊登为这些读者量身定制的广告。我毫不怀疑中美之间的电子邮件会被两国政府监控,用以搜集情报。 但是,随着电脑的计算能力日益强大,越来越多数据被收集,谁又能知道闲置的计算容量会不会被用来识别“老大哥”想要知道的任何事情呢?政府有我们的Google搜索记录和手机记录。 所谓的隐私实际上已不复存在了。(科学网 李璐璐译 何姣校)

2009年7月15日加注:今天的Campus Technology新闻简报刊登了一条以下模式识别的例子:

卡内基梅隆大学的研究者发现社会安全号码(Social Security Numbers)是可以被预测的
卡内基梅隆大学的研究者们发现从各类政府资源、商业数据库以及在线社会网络搜集来的公共信息可以被用于预测大多数——有的时候甚至是全部九位数字的社会安全号码。http://www.1105newsletters.com/t.do?id=2951907:780897



https://wap.sciencenet.cn/blog-1565-243352.html

上一篇:How to Supervise Yourself (怎么自导博士论文)
下一篇:Creative puns for educated minds – a test of your knowledge of the English langu
收藏 IP: .*| 热度|

4 李天成 刘进平 王立 钟云飞

发表评论 评论 (5 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-30 02:10

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部