有没有比dfm中的“ stem = TRUE”更好的方法来忽略复数?

时间:2019-04-20 10:49:50

标签: r nlp quanteda

我正在预处理我的数据以运行LDA模型。我想知道是否有比使用“ stem = TRUE”更好的方法来忽略复数(例如“ rates”,“ rate”,“ contry”,“ countries”)?我不想阻止所有单词,而只阻止一些经常出现的复数或单数的特定单词。

有任何提示吗?

我尝试使用"stem = TRUE",还创建了一个词典,并在dfm代码中使用了"dictonary=dict",但是显然它仅抓取词典中的单词。

1 个答案:

答案 0 :(得分:0)

执行此操作的最佳方法是使用工具标记您的复数名词,然后将其转换为单数。与词干解决方案不同,该词干不会词干,例如 stemming stem quickly quick

我建议为此使用 spacyr 软件包,该软件包与 quanteda 很好地集成在一起。这是一个示例:

library("quanteda")
## Package version: 1.4.3

library("spacyr")

txt <- c(
  "Plurals in English can include irregular words such as stimuli.",
  "One mouse, two mice, one house, two houses."
)
txt_parsed <- spacy_parse(txt, tag = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.1.3, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
txt_parsed
##    doc_id sentence_id token_id     token     lemma   pos tag     entity
## 1   text1           1        1   Plurals    plural  NOUN NNS           
## 2   text1           1        2        in        in   ADP  IN           
## 3   text1           1        3   English   English PROPN NNP LANGUAGE_B
## 4   text1           1        4       can       can  VERB  MD           
## 5   text1           1        5   include   include  VERB  VB           
## 6   text1           1        6 irregular irregular   ADJ  JJ           
## 7   text1           1        7     words      word  NOUN NNS           
## 8   text1           1        8      such      such   ADJ  JJ           
## 9   text1           1        9        as        as   ADP  IN           
## 10  text1           1       10   stimuli  stimulus  NOUN NNS           
## 11  text1           1       11         .         . PUNCT   .           
## 12  text2           1        1       One       one   NUM  CD CARDINAL_B
## 13  text2           1        2     mouse     mouse  NOUN  NN           
## 14  text2           1        3         ,         , PUNCT   ,           
## 15  text2           1        4       two       two   NUM  CD CARDINAL_B
## 16  text2           1        5      mice     mouse  NOUN NNS           
## 17  text2           1        6         ,         , PUNCT   ,           
## 18  text2           1        7       one       one   NUM  CD CARDINAL_B
## 19  text2           1        8     house     house  NOUN  NN           
## 20  text2           1        9         ,         , PUNCT   ,           
## 21  text2           1       10       two       two   NUM  CD CARDINAL_B
## 22  text2           1       11    houses     house  NOUN NNS           
## 23  text2           1       12         .         . PUNCT   .

# replace token with lemma for plural nouns
txt_parsed$token <- ifelse(txt_parsed$tag == "NNS",
  txt_parsed$lemma,
  txt_parsed$token
)

(当然,有很多方法可以执行此条件替换,包括 dplyr 。)

现在,作为复数名词的单词已被其单个名词变体所取代,包括不规则的变体,例如 stimuli mice ,这些词干都不会足够聪明找出答案。

dfmat <- dfm(as.tokens(txt_parsed), remove_punct = TRUE)
dfmat
## Document-feature matrix of: 2 documents, 14 features (50.0% sparse).
## 2 x 14 sparse Matrix of class "dfm"
##        features
## docs    plural in english can include irregular word such as stimulus one
##   text1      1  1       1   1       1         1    1    1  1        1   0
##   text2      0  0       0   0       0         0    0    0  0        0   2
##        features
## docs    mouse two house
##   text1     0   0     0
##   text2     2   2     2