从语料库中删除所有专有名称的有效方法

时间:2017-01-01 15:53:33

标签: r text

在R中工作,我试图找到一种有效的方法来搜索文本文件并删除或替换所有正确名称的实例(例如,Thomas)。我假设有一些东西可以做到这一点,但一直无法找到。

所以,在这个例子中," Susan"和#34; Bob"将被删除。这是一个简化的例子,实际上它希望这适用于数百个文档,因此也是一个相当大的名称列表。

texts <- as.data.frame (rbind (
    'This text stuff if quite interesting',
    'Where are all the names said Susan',
   'Bob wondered what happened to all the proper nouns'
    ))
names(texts) [1] <- "text"

1 个答案:

答案 0 :(得分:1)

这是一种基于firstnames数据集的方法:

install.packages("gender") 
library(gender)
install_genderdata_package()

sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)

texts <- as.data.frame (rbind (
  'This text stuff if quite interesting',
  'Where are all the names said Susan',
  'Bob wondered what happened to all the proper nouns'
))

removeWords <- function(txt, words, n = 30000L) {
  l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
  groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
  regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
  for (regex in regexes)  txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
  return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"           
# [2] "Where are all the names said "                  
# [3] " wondered what happened to all the proper nouns"

可能需要对特定数据集进行一些调整。

另一种方法可以基于词性标注。