使用定制的单词列表从文档语料库中删除单词

时间:2016-03-04 12:30:02

标签: list tm corpus

tm包能够让用户“修剪”文档语料库中的单词和标点符号:     tm_map(corpusDocs,removeWords,stopwords(“english”))

有没有办法为tm_map提供一个定制的单词列表,这些单词从csv文件读入并用来代替停用词(“英语”)?

谢谢。

BSL

2 个答案:

答案 0 :(得分:1)

让我们拿一个文件(wordMappings)

"from"|"to"
###Words######
"this"|"ThIs"
"is"|"Is"
"a"|"A"
"sample"|"SamPle"

首先删除单词;

readFile <- function(fileName, seperator) {
  read.csv(paste0("data\\", fileName, ".txt"), 
                             sep=seperator, #"\t", 
                             quote = "\"",
                             comment.char = "#",
                             blank.lines.skip = TRUE,
                             stringsAsFactors = FALSE,
                             encoding = "UTF-8")

}

kelimeler <- c("this is a sample")
corpus = Corpus(VectorSource(kelimeler))
seperatorOfTokens <- ' '
words <- readFile("wordMappings", "|")

toSpace <- content_transformer(function(x, from) gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", ' ', seperatorOfTokens), x))
for (word in words$from) {
  corpus <- tm_map(corpus, toSpace, word)
}

如果您需要更灵活的解决方案,例如删除替换为,则

#Specific Transformations
toMyToken <- content_transformer( function(x, from, to)
  gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", to, seperatorOfTokens), x))

for (i in seq(1:nrow(words))) {
  print(sprintf("%s -> %s ", words$from[i], words$to[i]))
  corpus <- tm_map(corpus, toMyToken, words$from[i], words$to[i])
}

现在运行样本;

[1] "this -> ThIs "
[1] "is -> Is "
[1] "a -> A "
[1] "sample -> SamPle "
> content(corpus[[1]])
[1] " ThIs Is A SamPle "
> 

答案 1 :(得分:0)

我的解决方案,可能既麻烦又不优雅:

#read in items to be removed
removalList = as.matrix( read.csv( listOfWordsAndPunc, header = FALSE ) )
#
#create document term matrix
termListing = colnames( corpusFileDocs_dtm )
#
#find intersection of terms in removalList and termListing
commonWords = intersect( removalList, termListing )
removalIndxs = match( commonWords, termListing )
#
#create m for term frequency, etc.
m = as.matrix( atsapFileDocs_dtm )
#
#use removalIndxs to drop irrelevant columns from m
allColIndxs = 1 : length( termListing )
keepColIndxs = setdiff( allColIndxs, removalIndxs )
m = m[ ,keepColIndxs ]
#
#thence to tf-idf analysis with revised m

非常感谢任何风格或计算方面的改进建议。

BSL