嗨我有一个单词列表,这些单词是使用" tm"包装在R. 我可以在这一步之后找回根词。在此先感谢。
Ex:activiti - >活性
答案 0 :(得分:1)
您可以使用stemCompletion()函数来实现此目的,但您可能需要先修剪茎。请考虑以下事项:
library(tm)
library(qdap) # providers the stemmer() function
active.text = "there are plenty of funny activities"
active.corp = Corpus(VectorSource(active.text))
(st.text = tolower(stemmer(active.text,warn=F)))
# this is what the columns of your Term Document Matrix are going to look like
[1] "there" "are" "plenti" "of" "funni" "activ"
st.text = gsub("[aeyuio]+$","",st.text) # removing vowels on the end of each word
stemCompletion(st.text,active.corp,"prevalent") # now it works
ther ar plent of funn activ
"there" "are" "plenty" "of" "funny" "activities"
请注意尽管词干会使某些词语混乱。例如,“大学”和“普遍”都在成长后成为“大学”,你无法做任何正确的恢复。
希望这有帮助。
答案 1 :(得分:0)
从包stemCompletion
查看tm
:
library(tm)
v <- "There are plenty of activities."
stemCompletion("activiti", scan_tokenizer(tolower(v)))
# activiti
# "activities"