我想从具有特定句子(两个或多个单词组合)而不是单个单词的文本创建TDM。句子可以是例如"climate change"
,"global worming"
,"lad use"
等。我看过的例子都是单词。
tabela = DocumentTermMatrix(textolimpo,
list(dictionary = c("climate change","global worming","land use")))
如果有人能帮助我,我感激不尽。
干杯。
圣拉斐尔
答案 0 :(得分:2)
我建议quanteda
:
library(quanteda)
textolimpo <- c("This climate change concerns me. This climate changes", "Wormed: global worming increased")
(dfm <- dfm(textolimpo,
ngrams=2L,
dictionary = list(climate="climate_change",
warm="global_worming"),
valuetype = "regex"))
# 2 x 2 sparse Matrix of class "dfmSparse"
# features
# docs climate warm
# text1 2 0
# text2 0 1
(dfm <- dfm(textolimpo,
ngrams=2L,
thesaurus = list(climate="climate_change",
warm="global_worming"),
valuetype = "regex"))
# 2 x 8 sparse Matrix of class "dfmSparse"
# this_climate change_concerns concerns_me me_this wormed_global worming_increased CLIMATE WARM
# text1 2 1 1 1 0 0 2 0
# text2 0 0 0 0 1 1 0 1