Question

我有一些推文，我想检测其中的表情符号。对于此任务，我想使用hash_emoticons包中的textclean词典。

hash_emoticons[1:5]
       x                 y
1:   #-) partied all night
2:    %)             drunk
3:   %-)             drunk
4: ',:-l        scepticism
5: ',:-|        scepticism

如果我将其与标准功能一起使用，则会出现此错误：

library(stringr)

str_detect(Tweets$text, hash_emoticons$x)


longer object length is not a multiple of shorter object lengthError in 
stri_detect_regex(string, pattern, opts_regex = opts(pattern)): 
Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

任何想法如何解决此问题？

Answer 1

这是直接使用stringi软件包的一种方法。但是，您需要更严格地解释/考虑一些边界因素

# Generate some data
xxx <- tibble(Text = c("asdasd", ":o)", "hej :o) :o) :-*"))

您要计算每个字符串中使用的表情符号的数量，因此需要考虑对每个表情符号进行字符串检测。 str_detect()将返回任何表情符号的存在，但不返回数字，因此我们改用stri_count_fixed()。

例如

library("stringi")
library("textclean")
# Run through each emoticon
# see if it matches each tweet
# Compute the number of hits
rowSums(sapply(lexicon::hash_emoticons$x, function(i) {
    stringi::stri_count_fixed(xxx$Text, pattern=i)
}))

返回

[1] 0 2 5

现在，如果您查看输入字符串，则会看到4个图释。元素:o)将匹配两个表情符号:o和:o)，这就是第二个元素为2的原因。相反，字符串hej :o) :o) :-*将返回5，这是因为它与{ {1}}两次，:o两次，:o)一次。

使用str_detect检测模式

1 个答案: