使用R进行文本挖掘

时间:2015-09-15 00:59:38

标签: r text-mining

我需要使用R

进行文本挖掘方面的帮助
Title      Date            Content    
Boy        May 13 2015     "She is pretty", Tom said. Tom is handsome.
Animal     June 14 2015    The penguin is cute, lion added.
Human      March 09 2015   Mr Koh predicted that every human is smart...
Monster    Jan 22 2015     Ms May, a student, said that John has $10.80. May loves you.

我只想从人们所说的内容中得到意见。

而且,我想寻求帮助来获得百分比(例如9.8%),因​​为当我基于fullstop(“。”)分割句子时,我会得到“他的结果提高了0”。而不是“他的结果提高了0.8%”。

以下是我想要获得的输出:

Title      Date            Content    
Boy        May 13 2015     she is pretty
Animal     June 14 2015    the penguin is cute
Human      March 09 2015   every human is smart
Monster    Jan 22 2015     john has $10.80

下面是我尝试的代码,但没有获得所需的输出:

list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
dataframe <- stack(setNames(lapply(strsplit(dataframe, '(?<=[.])', perl=TRUE), grep, pattern = pattern, value = TRUE), dataframe$Title))[2:1]

1 个答案:

答案 0 :(得分:2)

你很接近,但是用于分裂的正则表达式是错误的。这为数据提供了正确的安排,模块化了您更准确地提取意见的请求:

txt <- '
Title      Date            Content    
Boy        May 13 2015     "She is pretty", Tom said. Tom is handsome.
Animal     June 14 2015    The penguin is cute, lion added.
Human      March 09 2015   Mr Koh predicted that every human is smart...
Monster    Jan 22 2015     Ms May, a student, said that John has $10.80. May loves you.
'

txt <- gsub(" {2,}(?=\\S)", "|", txt, perl = TRUE)
dataframe <- read.table(sep = "|", text = txt, header = TRUE)

list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")

content <- strsplit(dataframe$Content, '\\.(?= )', perl=TRUE)
opinions <- lapply(content, grep, pattern = pattern, value = TRUE)
names(opinions) <- dataframe$Title
result <- stack(opinions)

在您的示例数据中,所有句点后跟空格都是句子结尾,因此正则表达式\.(?= )匹配。但是,这会破坏"I was born in the U.S.A. but I live in Canada"之类的句子,因此您可能需要进行额外的预处理和检查。

然后,假设Title是唯一标识符,您只需merge即可添加日期:

result <- merge(dataframe[c("Title", "Date")], result, by = "Title")

正如评论中所提到的,NLP任务本身更多地与文本解析相比,而不是R编程。你可以从寻找像

这样的模式中获得一些好处
<optional adjectives> <noun> <verb> <optional adverbs> <adjective> <optional and/or> <optional adjective> ...

可以匹配您的样本数据,但我在这里远非专家。你还需要一个有词汇类别的字典。谷歌搜索“摘录意见文本”在第一页上产生了很多有用的结果,包括由Bing Liu运行的this site。据我所知,刘教授从字面上写了关于情感分析的书。