目标
我想计算单词“love”出现在文档中的次数,但前提是它前面没有单词“not”,例如“我爱电影”算作一次亮相,而“我不爱电影”不算外表。
问题
如何继续使用tm包?
R代码
下面是一些自包含的代码,我想修改它来执行上述操作。
require(tm)
# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.",
"I do not love the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
"I hate the `Red Hot Chilli Peppers`!")
# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)
# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))
# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)
# construct dictionary
my.dictionary.terms <- tolower(c("love", "Hate"))
my.dictionary <- Dictionary(my.dictionary.terms)
# construct the term document matrix
my.tdm <- TermDocumentMatrix(my.corpus, control = list(dictionary = my.dictionary))
inspect(my.tdm)
# Terms positiveText neutralText negativeText
# hate 0 1 1
# love 2 1 0
更多信息
我正在尝试从商业软件包WordStat重现字典规则功能。它能够使用字典规则,即
“分层内容分析词典或分类法组成的 单词,单词模式,短语以及邻近规则(如 之前,之后,之前)用于实现概念的精确测量“
我也注意到这个有趣的问题:Open-source rule-based pattern matching / information extraction frameworks?
更新1:根据@Ben的评论和帖子,我得到了这个(虽然最后略有不同,但是受到了他的回答的强烈启发,因此对他充满信心)
require(data.table)
require(RWeka)
# bi-gram tokeniser function
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
# get all 1-gram and 2-gram word counts
tdm <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
# convert to data.table
dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=TRUE)
setkey(dt, rn)
# attempt at extracting but includes overlaps i.e. words counted twice
dt[like(rn, "love")]
# rn positiveText neutralText negativeText
# 1: i love 1 0 0
# 2: love 2 1 0
# 3: love peopl 1 0 0
# 4: love the 1 1 0
# 5: most love 1 0 0
# 6: not love 0 1 0
然后我想我需要做一些行子设置和行减法,这会导致类似
dt1 <- dt["love"]
# rn positiveText neutralText negativeText
#1: love 2 1 0
dt2 <- dt[like(rn, "love") & like(rn, "not")]
# rn positiveText neutralText negativeText
#1: not love 0 1 0
# somehow do something like
# DT = dt1 - dt2
# but I can't work out how to code that but the require output would be
# rn positiveText neutralText negativeText
#1: love 2 0 0
我不知道如何使用data.table获取最后一行,但这种方法类似于WordStats'NOT NEAR'字典功能,例如在这种情况下,如果它在“不”字之前或之后直接出现在单字内,则只计算“爱”这个词。
如果我们要做一个m-gram标记器,那么就好像我们只计算“爱”这个词,如果它不出现在单词“not”的任何一侧(m-1)之内。
非常欢迎其他方法!
答案 0 :(得分:1)
这是一个关于配置提取的有趣问题,虽然在语料库语言学中有多受欢迎,但似乎并没有内置到任何包中(this one除外,而不是CRAN或github)。我认为这段代码会回答你的问题,但可能会有一个更通用的解决方案。
这是你的例子(感谢易于使用的例子)
##############
require(tm)
# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.",
"I do not `love` the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
"I hate the `Red Hot Chilli Peppers`!")
# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)
# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))
# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
# 'not' is a stopword so let's not remove that
# my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)
# construct dictionary - not used in this case
# my.dictionary.terms <- tolower(c("love", "Hate"))
# my.dictionary <- Dictionary(my.dictionary.terms)
这是我的建议,制作一个双字母组合的文档术语矩阵并对它们进行子集化
#Tokenizer for n-grams and passed on to the term-document matrix constructor
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
inspect(txtTdmBi)
# find bigrams that have 'love' in them
love_bigrams <- txtTdmBi$dimnames$Terms[grep("love", txtTdmBi$dimnames$Terms)]
# keep only bigrams where 'love' is not the first word
# to avoid counting 'love' twice and so we can subset
# based on the preceeding word
require(Hmisc)
love_bigrams <- love_bigrams[sapply(love_bigrams, function(i) first.word(i)) != 'love']
# exclude the specific bigram 'not love'
love_bigrams <- love_bigrams[!love_bigrams == 'not love']
这就是结果,我们得到了'爱'的2,这已经排除了'不爱'的二重奏。
# inspect the results
inspect(txtTdmBi[love_bigrams])
A term-document matrix (2 terms, 3 documents)
Non-/sparse entries: 2/4
Sparsity : 67%
Maximal term length: 9
Weighting : term frequency (tf)
Docs
Terms positiveText neutralText negativeText
i love 1 0 0
most love 1 0 0
# get counts of 'love' (excluding 'not love')
colSums(as.matrix(txtTdmBi[love_bigrams]))
positiveText neutralText negativeText
2 0 0
答案 1 :(得分:0)
这听起来像极性。虽然我不打算回答你问的问题,但我可能会问你关于句子极性的更大问题。我已经在qdap版本1.2.0中实现了polarity
功能,可以执行此操作,但保存所要求的所有中间内容会使功能放慢太多。
library(qdap)
out <- apply_as_df(my.corpus, polarity, polarity.frame = POLENV)
lview(my.corpus)
df <- sentSplit(matrix2df(my.docs.df), "docs")
pols <- list(positives ="love", negatives="hate")
pols2 <- lapply(pols, function(x) term_match(df$docs, x, FALSE))
POLENV <- polarity_frame(positives =pols2[[1]], negatives=pols2[[2]])
output <- with(df, polarity(docs, var1, polarity.frame = POLENV))
counts(output)[, 1:5]
## > counts(output)[, 1:5]
## var1 wc polarity pos.words neg.words
## 1 positiveText 7 0.3779645 love -
## 2 positiveText 9 0.3333333 lovely -
## 3 neutralText 16 0.0000000 love hate
## 4 neutralText 5 0.0000000 - -
## 5 negativeText 7 -0.3779645 - hate
data.frame(scores(output))[, 1:4]
## var1 total.sentences total.words ave.polarity
## 1 negativeText 1 7 -0.3779645
## 2 neutralText 2 21 0.0000000
## 3 positiveText 2 16 0.3556489