Question

我在R学习文本挖掘并取得了相当不错的成功。但我仍然坚持如何处理复数。即我想要＆＃34;国家＆＃34;和＃＆＃34;国家＆＃34;被视为同一个词，理想情况下是＆＃34;字典＆＃34;和＆＃34;字典＆＃34;被算作同一个词。

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.'

Answer 1

一种可能的解决方案。在这里，我使用pacman包来使解决方案自包含：

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load_gh('hrbrmstr/pluralize')
p_load(quanteda)

x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries"'
singularize(unlist(tokenize(x)))

##  [1] "\""         "nation"     "\""         "and"        "\""         "nation"     "\""        
##  [8] "to"         "be"         "counted"    "a"          "the"        "same"       "word"      
## [15] "and"        "ideally"    "\""         "dictionary" "\""         "and"        "\""        
## [22] "dictionary" "\""

Answer 2

SemNetCleaner 包具有单数化功能。它比复数包慢，但我发现它对名词的处理更好。例如，Mars 不会转换为 Mar。

R文本挖掘 - 处理复数

2 个答案: