我有Yelp Academic Dataset中的单词列表,我正在尝试从中创建模型的功能列表。我想要一个虚拟变量来指示此列表中每个单词的存在/不存在。
示例:
评论拥有商店的老人尽可能地甜蜜。或许比饼干或冰淇淋更甜,,例如,首先会经常滤除和阻止词语。假设这留下 oldish , sweet , ice 和 cream 。我希望R为hasOldish
,hasSweet
,hasIce
和hasCream
自动生成一个新的虚拟变量。
有办法做到这一点吗?
答案 0 :(得分:1)
正如@Thomas评论的那样,你应该尝试一下或至少展示你尝试过的东西。我在这里使用tm
包。
txt <- "The oldish man who owns the store is as sweet as can be. Perhaps sweeter than the cookies or ice cream "
library(tm)
## create a corpus
dd = Corpus(VectorSource(txt))
scanner <- function(x) unlist(strsplit(x," "))
## define controls
## scanner to split words
## and dictionary since you ar looking only for special words
ctrl <- list(tokenize = scanner,
stemming = TRUE,
dictionary=c('oldish','sweet','ice','cream'))
termFreq(dd[[1]], control = ctrl)
oldish sweet ice cream
1 1 1 1
attr(,"class")
[1] "term_frequency" "integer"