使用tm-package进行文本挖掘 - 词干

时间:2013-04-17 20:15:17

标签: r text-mining tm

我正在使用tm - 包在R中进行一些文本挖掘。一切都很顺利。但是,在词干(http://en.wikipedia.org/wiki/Stemming)之后会出现一个问题。显然,有些词语具有相同的词干,但重要的是它们不会“被拼凑”(因为这些词语意思不同)。

有关示例,请参阅下面的4个文本。在这里你不能使用“讲师”或“讲座”(“协会”和“联系人”)可互换。但是,这是在第4步中完成的。

是否有任何优雅的解决方案如何手动实现某些案例/单词(例如,“讲师”和“讲座”是两个不同的东西)?

texts <- c("i am member of the XYZ association",
"apply for our open associate position", 
"xyz memorial lecture takes place on wednesday", 
"vote for the most popular lecturer")

# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))

# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus

# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")  

inspect(corpus.temp)

# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)  

inspect(corpus.final)

2 个答案:

答案 0 :(得分:10)

我不是100%你所追求的,并不完全了解tm_map的工作方式。如果我理解,那么以下工作。据我所知,你想提供一个不应该阻止的单词列表。我使用qdap包主要是因为我很懒,它有一个我喜欢的函数mgsub

请注意,我对使用mgsubtm_map感到沮丧,因为它一直在抛出错误,因此我只使用了lapply

texts <- c("i am member of the XYZ association",
    "apply for our open associate position", 
    "xyz memorial lecture takes place on wednesday", 
    "vote for the most popular lecturer")

library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))

library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")

# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)

# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")  

# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)

inspect(corpus)       #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)

# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)  
inspect(corpus.final)

基本上它起作用:

  1. 为所提供的“NO STEM”字词(mgsub
  2. 提供唯一标识符键
  3. 然后你干(使用stemDocument
  4. 接下来你将它反转并使用“NO STEM”字样(mgsub
  5. 分配标识符键
  6. 最后完成Stem(stemCompletion
  7. 以下是输出:

    ## >     inspect(corpus.final)
    ## A corpus with 4 text documents
    ## 
    ## The metadata consists of 2 tag-value pairs and a data frame
    ## Available tags are:
    ##   create_date creator 
    ## Available variables in the data frame are:
    ##   MetaID 
    ## 
    ## $`1`
    ## i am member of the XYZ associate
    ## 
    ## $`2`
    ##  for our open associate position
    ## 
    ## $`3`
    ## xyz memorial lecture takes place on wednesday
    ## 
    ## $`4`
    ## vote for the most popular lecturer
    

答案 1 :(得分:0)

您还可以使用以下软件包来识别单词:https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf

你只需要使用函数 wordStem ,传递要被阻止的单词的向量以及你正在处理的语言。要了解您需要使用的语言字符串,您可以参考 getStemLanguages 方法,它将返回所有可能的选项。

亲切的问候