老码农分享 http://blog.sciencenet.cn/u/seawan //敲键读书打酱油;

博文

TermDocumentMatrix的几个参数

已有 11352 次阅读 2012-6-17 20:29 |个人分类:tm|系统分类:科研笔记| 中文

从网上找到的代码片段,
在对分词后的中文文本进行处理时,
往往仅仅使用类似:

c <- Corpus(VectorSource(re))

的代码,来构造语料库,然后使用
TermDocumentMatrix(c)
函数,来求词汇文档矩阵。
例如这个博文:中文文本挖掘小例子及程序


但是,由于汉字的特殊性,使用如下参数中的一个或多个可以避免可能的分析错误:
(参阅 termFreq {tm} )

removePunctuation

A logical value indicating whether punctuation characters should be removed from doc, a custom function which performs punctuation removal, or a list of arguments for removePunctuation. Defaults to FALSE.

removeNumbers

A logical value indicating whether numbers should be removed from doc or a custom function for number removal. Defaults to FALSE.

stopwords

Either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package, a character vector holding custom stopwords, or a custom function for stopword removal. Defaults to FALSE.


bounds

A list with a tag local whose value must be an integer vector of length 2. Terms that appear less often in doc than the lower bound bounds$local[1] or more often than the upper bound bounds$local[2] are discarded. Defaults to list(local = c(1,Inf)) (i.e., every token will be used).

wordLengths

An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.


一个例子:
dtm<-TermDocumentMatrix(ovid, 
    control = list(wordLengths = c(1, Inf), 
removePunctuation = TRUE,
removeNumbers = TRUE, 
stopwords=FALSE) )



https://wap.sciencenet.cn/blog-461456-583133.html

上一篇:『转设』的玄机
下一篇:『借贷消费』提升人生幸福:513才会相信
收藏 IP: 113.59.89.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-12-28 03:12

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部