Question

我正在处理从Twitter的公共API中提取的一组推文，并尝试进行一些文本分析。

目前，我有推文的数据框，文字位于标题为total.tweets$text的栏目中，其条目如下：

我有一个情感词典（即 - 正面，负面等），并将csv文件中的每一列作为字符串拉出来：

posTerms＆lt; - toString（na.omit（lexicon $ Positiv））

我想计算每个推文中出现此文件中任何正数的次数，为该计数创建一个新列total.tweets$PosCount。

例如，推文内容为：Our greatest glory is not in never falling, but in rising every time we fall #confucius #entrepreneurship

如果greatest glory和rising位于肯定词中，则PosCount将为3。

我尝试使用strcount，如下所示：

posTerms <- toString(na.omit(lexicon$Positiv))

total.tweets $ Positiv＆lt; - str_count（total.tweets $ text，paste（posTerms，collapse ='|'））

但不断收到此错误：

错误：正则表达式无效'ABIDE，ABILITY，ABLE，ABOUND，。。。

任何想法都将不胜感激！

Answer 1

似乎使用％in％运算符

lexicon = c("greatest" ,"glory" , "rising")
sentence = "Our greatest glory is not in never falling, but in rising every time we fall #confucius #entrepreneurship"

使用strsplit

将句子拆分为单词

words = strsplit(sen," ")

sum(words[[1]] %in% lexicon) #returns the number of words that matches lexicon

Answer 2

使用stringi包：

text <- "Our greatest glory is not in never falling, but in rising every time we fall #confucius #entrepreneurship"
search <- c("greatest", "glory","rising")
require(stringi)
stri_detect_fixed(text, search)
## [1] TRUE TRUE TRUE
sum(stri_detect_fixed(text, search))
## [1] 3

计算文本条目（R）中列表中单词的出现次数

2 个答案: