阻止每个单词

时间:2018-10-07 13:00:40

标签: r tm quanteda

我想阻止每个单词。例如,“ hardworking employees”应转换为“ hardwork employee”而不是“ hardworking employee”。用简单的话来说,它应该分别阻止两个词。我知道这没有道理。但这是一个例子。实际上,我有这样的词干是有道理的。

我具有使用定界符','来考虑单词,然后执行词干分析的功能。我希望对其进行修改,以便对','分隔符内的所有单词进行词干搜索。

dt = read.table(header = TRUE, 
text ="Word Synonyms
employee 'hardworking employees, intelligent employees, employment, employee'
lover  'loved ones, loving boy, lover'
", stringsAsFactors= F)

library(SnowballC)
library(parallel)

stem_text3<- function(text, language = "english", mc.cores = 3) {
  stem_string <- function(str, language) {
    str <- strsplit(x = str, split = "\\,")
    str <- wordStem(unlist(str), language = language)
    str <- paste(str, collapse = ",")
    return(str)
  }

  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language)

  # return stemed text blocks
  return(unlist(x))
}

df000 <- data.frame(stringsAsFactors = F)
for (i in 1:nrow(dt)){
  sent = dt[i, "Synonyms"]
  k = data.frame(r_synonyms = stem_text3(sent, language = 'en'), stringsAsFactors = F)
  df000= rbind(df000,k)
}

1 个答案:

答案 0 :(得分:2)

这很棘手,因为SnowballC::wordStem()阻止了字符向量的每个元素,因此需要对字符向量进行拆分和重组才能使用它。

我将放弃循环,并使用Apply操作对其进行矢量化(您可以将其交换为mclapply()

library("stringi")
dt[["Synonyms"]] <- 
    sapply(stri_split_fixed(dt[["Synonyms"]], ","), function(x) {
        x <- lapply(stri_split_fixed(stri_trim_both(x), " "), function(y) {
            paste(SnowballC::wordStem(y), collapse = " ")
        })
        paste(x, collapse = ", ")
    })

dt
##       Word                                            Synonyms
## 1 employee hardwork employe, intellig employe, employ, employe
## 2    lover                            love on, love boi, lover

注意:

首先,我认为这不是您对茎的期望,但这就是Porter茎杆在 SnowballC 中实现的工作方式。

第二,有更好的方法来整体解决此问题,但是除非您在提出此问题时说明您的目标,否则我无法真正回答。例如,要在 quanteda 中替换一组短语(用通配符代替词干),您可以执行以下操作:

library("quanteda")
thedict <- dictionary(list(
    employee = c("hardwork* employ*", "intellig* employ*", "employment", "employee*"),
    lover = c("lov* ones", "lov* boy", "lover*")
))

tokens("Some employees are hardworking employees in useful employment.  
        They support loved osuch as their wives and lovers.") %>%
    tokens_lookup(dictionary = thedict, exclusive = FALSE, capkeys = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "Some"     "employee" "are"      "employee" "in"       "useful"   "employee"
##  [8] "."        "They"     "support"  "loved"    "osuch"    "as"       "their"   
## [15] "wives"    "and"      "lover"    "."