Question

我一直在玩R的情绪分析功能，并且在运行gsub函数时会遇到错误。正面和负面的单词列表来自here。

经过一些谷歌搜索，我在R帮助列表中发现了一个这个错误，但没有别的。有没有人遇到这个问题？到底是怎么回事？有解决方法吗？

我在过去处理字符串时运行了类似的代码（使用gsub和stringer包），这是我第一次遇到这种类型的错误。此外，我尝试通过在不同的字符串集上编写类似的脚本来重现此错误，并且工作正常。

以下是错误消息：

> pos_match <- str_c(vpos, collapse = "|")
> neg_match <- str_c(vneg, collapse = "|")
> dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression, reason 'Out of memory'

这是整个'过程'。

## SET WORKING DIRECTOR AND IMPORT PACKAGES:
setwd("~/Desktop/R_Tricks")
require(tm); require(stringr); require(lubridate); library(RTextTools)

# IMPORT DATA:
d1 <- read.csv("Video_Comments.csv", stringsAsFactors=FALSE, sep=",", fileEncoding="ISO_8859-2")
pos <- read.csv("positive-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
neg <- read.csv("negative-words.csv", stringsAsFactors=FALSE, header=TRUE, fileEncoding="ISO_8859-2")
vpos = as.vector(pos[,1]); vneg = as.vector(neg[,1])
head(vpos); head(vneg)
colnames(d1); nrow(d1); ncol(d1)
str(d1); head(d1)
table(d1$Likes); table(d1$Replies)
nrow(vpos); nrow(vneg)
length(vpos); length(vneg)
is.atomic(vpos); is.atomic(vneg)

# SELECT DATA:
dat = data.frame(Comment=c(d1$Comment))
head(dat)
# CLEAN DATA - COMMENTS:
dat$Comment = gsub('[[:punct:]]', '', dat$Comment)
dat$Comment = gsub('[[:cntrl:]]', '', dat$Comment)
dat$Comment = gsub('\\d+', '', dat$Comment)
dat$Comment = tolower(dat$Comment)
head(dat)
# CLEAN DATA - CLASSIFICATIONS:
vpos = gsub('[[:punct:]]', '', vpos); vneg = gsub('[[:punct:]]', '', vneg)
vpos = gsub('[[:cntrl:]]', '', vpos); vneg = gsub('[[:cntrl:]]', '', vneg)
vpos = gsub('\\d+', '', vpos); vneg = gsub('\\d+', '', vneg)
vpos = tolower(vpos); vneg = tolower(vneg)
head(vpos); head(vneg)

# MATCH WORDS WITH FACEBOOK COMMENTS:
pos_match <- str_c(vpos, collapse = "|")
neg_match <- str_c(vneg, collapse = "|")
dat$positive <- as.numeric(str_detect(dat$Comment, pos_match))
dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))

编辑：

我收到的另一条错误消息如下：

> dat$negative <- as.numeric(str_detect(dat$Comment, neg_match))
Error: invalid regular expression 'faced|faces|abnormal|abolish|abominable|abominably|abominate|abomination|abort|aborted|

编辑2：

重现错误的数据：

dat = c("Hey guys I am Aliza Lomez...18 y.o. I need your likes please like my page and find love quotes, beauty tips and much more.Please like my page you will never regret thank u all\u0083 <3 <3 <3...",
        "Alexandra Saturn", "And that's what makes a Subaru a Subaru", "Missouri in a battleground....; meanwhile in southern California....", "What the Frisbee", "very cool !!!!", "Get a life", 
        "Try that with my GT!!!", "Did he make any money?", "Wo! WO! BSMITH THROWING DISCS WITH SUBARUS?!?! THIS IS SO AWESOME! SHOULD OF USED AN STI THO")

Answer 1

我不知道整个解决方案，但我可以帮助您入门。我做了这个社区维基，希望有人可以填补空白......

对于无效的正则表达式，要创建OR，您需要将所有内容括在括号中。例如，如果要匹配单词“a”，“an”或“the”，则可以使用正则表达式字符串(a|an|the)。如果我有一个单词列表，我想与正则表达式中的OR匹配，这是我通常使用的：

mywords <- c("a", "an", "the")
mystring <- paste0("(", paste(mywords, collapse="|"), ")")

> mystring
[1] "(a|an|the)"

这应该可以消除无效的正则表达式错误，因为你的字符串不是以一个开括号开头，而是以管道而不是一个右括号结束。

正则表达式错误消息 - “内存不足”

1 个答案: