Question

我想阻止每个单词。例如，“ hardworking employees”应转换为“ hardwork employee”而不是“ hardworking employee”。用简单的话来说，它应该分别阻止两个词。我知道这没有道理。但这是一个例子。实际上，我有这样的词干是有道理的。

我具有使用定界符'，'来考虑单词，然后执行词干分析的功能。我希望对其进行修改，以便对'，'分隔符内的所有单词进行词干搜索。

dt = read.table(header = TRUE, 
text ="Word Synonyms
employee 'hardworking employees, intelligent employees, employment, employee'
lover  'loved ones, loving boy, lover'
", stringsAsFactors= F)

library(SnowballC)
library(parallel)

stem_text3<- function(text, language = "english", mc.cores = 3) {
  stem_string <- function(str, language) {
    str <- strsplit(x = str, split = "\\,")
    str <- wordStem(unlist(str), language = language)
    str <- paste(str, collapse = ",")
    return(str)
  }

  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language)

  # return stemed text blocks
  return(unlist(x))
}

df000 <- data.frame(stringsAsFactors = F)
for (i in 1:nrow(dt)){
  sent = dt[i, "Synonyms"]
  k = data.frame(r_synonyms = stem_text3(sent, language = 'en'), stringsAsFactors = F)
  df000= rbind(df000,k)
}

Answer 1

这很棘手，因为SnowballC::wordStem()阻止了字符向量的每个元素，因此需要对字符向量进行拆分和重组才能使用它。

我将放弃循环，并使用Apply操作对其进行矢量化（您可以将其交换为mclapply()。

library("stringi")
dt[["Synonyms"]] <- 
    sapply(stri_split_fixed(dt[["Synonyms"]], ","), function(x) {
        x <- lapply(stri_split_fixed(stri_trim_both(x), " "), function(y) {
            paste(SnowballC::wordStem(y), collapse = " ")
        })
        paste(x, collapse = ", ")
    })

dt
##       Word                                            Synonyms
## 1 employee hardwork employe, intellig employe, employ, employe
## 2    lover                            love on, love boi, lover

注意：

首先，我认为这不是您对茎的期望，但这就是Porter茎杆在 SnowballC 中实现的工作方式。

第二，有更好的方法来整体解决此问题，但是除非您在提出此问题时说明您的目标，否则我无法真正回答。例如，要在 quanteda 中替换一组短语（用通配符代替词干），您可以执行以下操作：

library("quanteda")
thedict <- dictionary(list(
    employee = c("hardwork* employ*", "intellig* employ*", "employment", "employee*"),
    lover = c("lov* ones", "lov* boy", "lover*")
))

tokens("Some employees are hardworking employees in useful employment.  
        They support loved osuch as their wives and lovers.") %>%
    tokens_lookup(dictionary = thedict, exclusive = FALSE, capkeys = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "Some"     "employee" "are"      "employee" "in"       "useful"   "employee"
##  [8] "."        "They"     "support"  "loved"    "osuch"    "as"       "their"   
## [15] "wives"    "and"      "lover"    "."

阻止每个单词

1 个答案: