从R中的变量列表创建要素

时间:2014-02-26 14:28:13

标签: r text feature-extraction

我有Yelp Academic Dataset中的单词列表,我正在尝试从中创建模型的功能列表。我想要一个虚拟变量来指示此列表中每个单词的存在/不存在。

示例:

评论拥有商店的老人尽可能地甜蜜。或许比饼干或冰淇淋更甜,,例如,首先会经常滤除和阻止词语。假设这留下 oldish sweet ice cream 。我希望R为hasOldishhasSweethasIcehasCream自动生成一个新的虚拟变量。

有办法做到这一点吗?

1 个答案:

答案 0 :(得分:1)

正如@Thomas评论的那样,你应该尝试一下或至少展示你尝试过的东西。我在这里使用tm包。

txt <- "The oldish man who owns the store is as sweet as can be. Perhaps sweeter than the cookies or ice cream "

library(tm)
## create a corpus
dd = Corpus(VectorSource(txt))
scanner <- function(x) unlist(strsplit(x," "))
## define controls
## scanner to split words
## and dictionary since you ar looking only for special words
ctrl <- list(tokenize = scanner,
             stemming = TRUE,
             dictionary=c('oldish','sweet','ice','cream'))
termFreq(dd[[1]], control = ctrl)

oldish  sweet    ice  cream 
     1      1      1      1 
attr(,"class")
[1] "term_frequency" "integer"