我正在预处理我的数据以运行LDA模型。我想知道是否有比使用“ stem = TRUE”更好的方法来忽略复数(例如“ rates”,“ rate”,“ contry”,“ countries”)?我不想阻止所有单词,而只阻止一些经常出现的复数或单数的特定单词。
有任何提示吗?
我尝试使用"stem = TRUE"
,还创建了一个词典,并在dfm代码中使用了"dictonary=dict"
,但是显然它仅抓取词典中的单词。
答案 0 :(得分:0)
执行此操作的最佳方法是使用工具标记您的复数名词,然后将其转换为单数。与词干解决方案不同,该词干不会词干,例如 stemming 到 stem 或 quickly 到 quick 等
我建议为此使用 spacyr 软件包,该软件包与 quanteda 很好地集成在一起。这是一个示例:
library("quanteda")
## Package version: 1.4.3
library("spacyr")
txt <- c(
"Plurals in English can include irregular words such as stimuli.",
"One mouse, two mice, one house, two houses."
)
txt_parsed <- spacy_parse(txt, tag = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.1.3, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
txt_parsed
## doc_id sentence_id token_id token lemma pos tag entity
## 1 text1 1 1 Plurals plural NOUN NNS
## 2 text1 1 2 in in ADP IN
## 3 text1 1 3 English English PROPN NNP LANGUAGE_B
## 4 text1 1 4 can can VERB MD
## 5 text1 1 5 include include VERB VB
## 6 text1 1 6 irregular irregular ADJ JJ
## 7 text1 1 7 words word NOUN NNS
## 8 text1 1 8 such such ADJ JJ
## 9 text1 1 9 as as ADP IN
## 10 text1 1 10 stimuli stimulus NOUN NNS
## 11 text1 1 11 . . PUNCT .
## 12 text2 1 1 One one NUM CD CARDINAL_B
## 13 text2 1 2 mouse mouse NOUN NN
## 14 text2 1 3 , , PUNCT ,
## 15 text2 1 4 two two NUM CD CARDINAL_B
## 16 text2 1 5 mice mouse NOUN NNS
## 17 text2 1 6 , , PUNCT ,
## 18 text2 1 7 one one NUM CD CARDINAL_B
## 19 text2 1 8 house house NOUN NN
## 20 text2 1 9 , , PUNCT ,
## 21 text2 1 10 two two NUM CD CARDINAL_B
## 22 text2 1 11 houses house NOUN NNS
## 23 text2 1 12 . . PUNCT .
# replace token with lemma for plural nouns
txt_parsed$token <- ifelse(txt_parsed$tag == "NNS",
txt_parsed$lemma,
txt_parsed$token
)
(当然,有很多方法可以执行此条件替换,包括 dplyr 。)
现在,作为复数名词的单词已被其单个名词变体所取代,包括不规则的变体,例如 stimuli 和 mice ,这些词干都不会足够聪明找出答案。
dfmat <- dfm(as.tokens(txt_parsed), remove_punct = TRUE)
dfmat
## Document-feature matrix of: 2 documents, 14 features (50.0% sparse).
## 2 x 14 sparse Matrix of class "dfm"
## features
## docs plural in english can include irregular word such as stimulus one
## text1 1 1 1 1 1 1 1 1 1 1 0
## text2 0 0 0 0 0 0 0 0 0 0 2
## features
## docs mouse two house
## text1 0 0 0
## text2 2 2 2