Question

我有Yelp Academic Dataset中的单词列表，我正在尝试从中创建模型的功能列表。我想要一个虚拟变量来指示此列表中每个单词的存在/不存在。

示例：

评论拥有商店的老人尽可能地甜蜜。或许比饼干或冰淇淋更甜，，例如，首先会经常滤除和阻止词语。假设这留下 oldish ， sweet ， ice 和 cream 。我希望R为hasOldish，hasSweet，hasIce和hasCream自动生成一个新的虚拟变量。

有办法做到这一点吗？

Answer 1

正如@Thomas评论的那样，你应该尝试一下或至少展示你尝试过的东西。我在这里使用tm包。

txt <- "The oldish man who owns the store is as sweet as can be. Perhaps sweeter than the cookies or ice cream "

library(tm)
## create a corpus
dd = Corpus(VectorSource(txt))
scanner <- function(x) unlist(strsplit(x," "))
## define controls
## scanner to split words
## and dictionary since you ar looking only for special words
ctrl <- list(tokenize = scanner,
             stemming = TRUE,
             dictionary=c('oldish','sweet','ice','cream'))
termFreq(dd[[1]], control = ctrl)

oldish  sweet    ice  cream 
     1      1      1      1 
attr(,"class")
[1] "term_frequency" "integer"

从R中的变量列表创建要素

1 个答案: