如何将多个qdap转换链接在一起用于R中的文本挖掘/情感(极性)分析

时间:2015-12-01 14:49:24

标签: r text-mining sentiment-analysis tm qdap

我的data.frame有周数,week和文字评论text。我想将week变量视为我的分组变量,并对其进行一些基本的文本分析(例如qdap::polarity)。一些评论文本有多个句子;但是,我只关心本周的“极端”极性。

如何在运行qdap::polarity之前将多个文本转换链接在一起并遵守其警告消息?我可以使用tm::tm_maptm::tm_reduce将变换链接在一起 - 在qdap中是否存在可比较的内容?在运行qdap::polarity和/或qdap::sentSplit之前预处理/转换此文字的正确方法是什么?

以下代码/可重现示例中的更多详细信息:

library(qdap)
library(tm)

df <- data.frame(week = c(1, 1, 1, 2, 2, 3, 4),
                 text = c("This is some text. It was bad. Not good.",
                          "Another review that was bad!",
                          "Great job, very helpful; more stuff here, but can't quite get it.",
                          "Short, poor, not good Dr. Jay, but just so-so. And some more text here.",
                          "Awesome job! This was a great review. Very helpful and thorough.",
                          "Not so great.",
                          "The 1st time Mr. Smith helped me was not good."),
                 stringsAsFactors = FALSE)

docs <- as.Corpus(df$text, df$week)

funs <- list(stripWhitespace,
             tolower,
             replace_ordinal,
             replace_number,
             replace_abbreviation)

# Is there a qdap function that does something similar to the next line?
# Or is there a way to pass this VCorpus / Corpus directly to qdap::polarity?
docs <- tm_map(docs, FUN = tm_reduce, tmFuns = funs)


# At the end of the day, I would like to get this type of output, but adhere to
# the warning message about running sentSplit. How should I pre-treat / cleanse
# these sentences, but keep the "week" grouping?
pol <- polarity(df$text, df$week)

## Not run:
# check_text(df$text)

1 个答案:

答案 0 :(得分:1)

您可以按警告中的建议运行execute="@this" ,如下所示:

sentSplit

请注意,我在github上提供了一个突破情绪包sentimentr,这是对 qdap 版本的速度,功能和文档的改进。这会在df_split <- sentSplit(df, "text") with(df_split, polarity(text, week)) ## week total.sentences total.words ave.polarity sd.polarity stan.mean.polarity ## 1 1 5 26 -0.138 0.710 -0.195 ## 2 2 6 26 0.342 0.402 0.852 ## 3 3 1 3 -0.577 NA NA ## 4 4 2 10 0.000 0.000 NaN 函数内部进行句子分割。下面的脚本允许您安装包并使用它:

sentiment_by