如何在tm字典中实现邻近规则来计算单词?

时间:2013-07-31 19:22:17

标签: r nlp weka data.table tm

目标

我想计算单词“love”出现在文档中的次数,但前提是它前面没有单词“not”,例如“我爱电影”算作一次亮相,而“我不爱电影”不算外表。

问题

如何继续使用tm包?

R代码

下面是一些自包含的代码,我想修改它来执行上述操作。

require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
          "I do not love the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
          "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary
my.dictionary.terms <- tolower(c("love", "Hate"))
my.dictionary <- Dictionary(my.dictionary.terms)

# construct the term document matrix
my.tdm <- TermDocumentMatrix(my.corpus, control = list(dictionary = my.dictionary))
inspect(my.tdm)

# Terms  positiveText neutralText negativeText
# hate            0           1            1
# love            2           1            0

更多信息

我正在尝试从商业软件包WordStat重现字典规则功能。它能够使用字典规则,即

  

“分层内容分析词典或分类法组成的   单词,单词模式,短语以及邻近规则(如   之前,之后,之前)用于实现概念的精确测量“

我也注意到这个有趣的问题:Open-source rule-based pattern matching / information extraction frameworks?


更新1:根据@Ben的评论和帖子,我得到了这个(虽然最后略有不同,但是受到了他的回答的强烈启发,因此对他充满信心)

require(data.table)
require(RWeka)

# bi-gram tokeniser function
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

# get all 1-gram and 2-gram word counts
tdm <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))

# convert to data.table
dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=TRUE)
setkey(dt, rn)

# attempt at extracting but includes overlaps i.e. words counted twice 
dt[like(rn, "love")]
#            rn positiveText neutralText negativeText
# 1:     i love            1           0            0
# 2:       love            2           1            0
# 3: love peopl            1           0            0
# 4:   love the            1           1            0
# 5:  most love            1           0            0
# 6:   not love            0           1            0

然后我想我需要做一些行子设置和行减法,这会导致类似

dt1 <- dt["love"]
#     rn positiveText neutralText negativeText
#1: love            2           1            0

dt2 <- dt[like(rn, "love") & like(rn, "not")]
#         rn positiveText neutralText negativeText
#1: not love            0           1            0

# somehow do something like 
# DT = dt1 - dt2 
# but I can't work out how to code that but the require output would be
#     rn positiveText neutralText negativeText
#1: love            2           0            0

我不知道如何使用data.table获取最后一行,但这种方法类似于WordStats'NOT NEAR'字典功能,例如在这种情况下,如果它在“不”字之前或之后直接出现在单字内,则只计算“爱”这个词。

如果我们要做一个m-gram标记器,那么就好像我们只计算“爱”这个词,如果它不出现在单词“not”的任何一侧(m-1)之内。

非常欢迎其他方法!

2 个答案:

答案 0 :(得分:1)

这是一个关于配置提取的有趣问题,虽然在语料库语言学中有多受欢迎,但似乎并没有内置到任何包中(this one除外,而不是CRAN或github)。我认为这段代码会回答你的问题,但可能会有一个更通用的解决方案。

这是你的例子(感谢易于使用的例子)

##############
require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
             "I do not `love` the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
             "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
# 'not' is a stopword so let's not remove that
# my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary - not used in this case
# my.dictionary.terms <- tolower(c("love", "Hate"))
# my.dictionary <- Dictionary(my.dictionary.terms)

这是我的建议,制作一个双字母组合的文档术语矩阵并对它们进行子集化

#Tokenizer for n-grams and passed on to the term-document matrix constructor
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
inspect(txtTdmBi)

# find bigrams that have 'love' in them
love_bigrams <- txtTdmBi$dimnames$Terms[grep("love", txtTdmBi$dimnames$Terms)]

# keep only bigrams where 'love' is not the first word
# to avoid counting 'love' twice and so we can subset 
# based on the preceeding word
require(Hmisc)
love_bigrams <- love_bigrams[sapply(love_bigrams, function(i) first.word(i)) != 'love']
# exclude the specific bigram 'not love'
love_bigrams <- love_bigrams[!love_bigrams == 'not love']

这就是结果,我们得到了'爱'的2,这已经排除了'不爱'的二重奏。

# inspect the results
inspect(txtTdmBi[love_bigrams])

A term-document matrix (2 terms, 3 documents)

Non-/sparse entries: 2/4
Sparsity           : 67%
Maximal term length: 9 
Weighting          : term frequency (tf)

           Docs
Terms       positiveText neutralText negativeText
  i love               1           0            0
  most love            1           0            0

# get counts of 'love' (excluding 'not love')
colSums(as.matrix(txtTdmBi[love_bigrams]))
positiveText  neutralText negativeText 
           2            0            0 

答案 1 :(得分:0)

这听起来像极性。虽然我不打算回答你问的问题,但我可能会问你关于句子极性的更大问题。我已经在qdap版本1.2.0中实现了polarity功能,可以执行此操作,但保存所要求的所有中间内容会使功能放慢太多。

library(qdap)
out <- apply_as_df(my.corpus, polarity, polarity.frame = POLENV)
lview(my.corpus)

df <- sentSplit(matrix2df(my.docs.df), "docs")

pols <- list(positives ="love", negatives="hate")
pols2 <- lapply(pols, function(x) term_match(df$docs, x, FALSE))
POLENV <- polarity_frame(positives =pols2[[1]], negatives=pols2[[2]])


output <- with(df, polarity(docs, var1, polarity.frame = POLENV))
counts(output)[, 1:5]

## > counts(output)[, 1:5]
##           var1 wc   polarity pos.words neg.words
## 1 positiveText  7  0.3779645      love         -
## 2 positiveText  9  0.3333333    lovely         -
## 3  neutralText 16  0.0000000      love      hate
## 4  neutralText  5  0.0000000         -         -
## 5 negativeText  7 -0.3779645         -      hate

data.frame(scores(output))[, 1:4]

##           var1 total.sentences total.words ave.polarity
## 1 negativeText               1           7   -0.3779645
## 2  neutralText               2          21    0.0000000
## 3 positiveText               2          16    0.3556489