我有这种数据:
library(dplyr)
glimpse(samp)
Observations: 10
Variables: 2
$ text <chr> "@VirginAmerica What @dhepburn said.", "@VirginAmerica plus you've ...
$ airline_sentiment <chr> "neutral", "positive", "neutral", "negative", "negative", "negative...
我想将text变量中的单词的出现与词典中的单词进行比较,即我想根据词典来计算某个单词在文本中出现的频率。
词典看起来像这样
library(lexicon)
hash_sentiment_sentiword[1:5]
x y
1: 365 days -0.50
2: 366 days 0.25
3: 3tc -0.25
4: a fortiori 0.25
5: a good deal 0.25
我知道有类似str_detect
的功能。但是,由此,我只会得到true / false值。
结果应该是这样的(伪代码):
text x y n
1. word 1 word 1 score 2
2. word 2 word 2 score 1
3. word 3 word 3 score 10
4. word 4 word 4 score 0
5. word 5 word 5 score 0
...
text:samp
中text列的一个单词;
x和y:hash_sentiment_sentiword
中的x和y列;
n:单词x出现在文本中的频率,例如,单词“ awesome”出现在x中,并在文本中出现一次。因此,对于“ awesome”,n为1。“ country”不在x中,而在文本中。因此n为0。
这里有一个小的dput()
:
dput(samp)
structure(list(text = c("@VirginAmerica Thanks!", "@VirginAmerica SFO-PDX schedule is still MIA.",
"@VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo",
"@VirginAmerica I flew from NYC to SFO last week and couldn't fully sit in my seat due to two large gentleman on either side of me. HELP!",
"I <U+2764><U+FE0F> flying @VirginAmerica. <U+263A><U+FE0F><U+0001F44D>",
"@VirginAmerica you know what would be amazingly awesome? BOS-FLL PLEASE!!!!!!! I want to fly with only you."
), airline_sentiment = c("positive", "negative", "positive",
"negative", "positive", "positive")), row.names = 15:20, class = "data.frame")
答案 0 :(得分:1)
One way of doing this, and there are as many as there are text-mining packages, is using tidytext. I chose tidytext because you are using dplyr and this plays nice with this. I'm using an inner_join to join the lexicon with your data. Change this to a left_join if you want to keep the words that are not a match in the lexicon.
library(tidytext)
library(dplyr)
samp %>%
unnest_tokens(text, output = "words", token = "tweets") %>%
inner_join(lexicon::hash_sentiment_sentiword, by = c("words" = "x")) %>%
group_by(words, y) %>%
summarise(n = n())
# A tibble: 20 x 3
# Groups: words [?]
words y n
<chr> <dbl> <int>
1 about 0.25 1
2 amazingly 0.125 1
3 cross -0.75 1
4 due 0.25 1
5 excited 0 1
6 first 0.375 1
7 fly -0.5 1
8 fully 0.375 1
9 help 0.208 1
10 know 0.188 1
11 large -0.25 1
12 last -0.208 1
13 lax -0.375 1
14 on 0.125 1
15 please 0.125 1
16 side -0.125 1
17 still -0.107 1
18 thanks 0 1
19 virgin 0.25 1
20 want 0.125 1
extra info for tidytext: tidy text mining with R
cran task view Natural Language Programming
其他软件包:Quanteda,qdap,情感软件,udpipe
答案 1 :(得分:1)
这是R
的基本解决方案
# create an array of all the words in samp$text
# optional: use regex to remove punctuation symbols (this can be refined)
textWords <- unlist(strsplit(gsub('[[:punct:]]','',samp$text,perl=TRUE), ' '))
# count occurences of each word and store it as data frame
occurences <- unique(data.frame(text = textWords,
n = as.integer(ave(textWords, textWords, FUN = length)),
stringsAsFactors = FALSE))
# get words of x with scores y
xWordsList <- setNames(strsplit(lexicon::hash_sentiment_sentiword$x, ' '),
lexicon::hash_sentiment_sentiword$y)
# create the result data frame
res <- data.frame(x = unlist(xWordsList), y = rep(names(xWordsList), lengths(xWordsList)))
rm(xWordsList) # removing as object is rather large and not needed anymore
# subset to keep only x elements which are in text
res <- res[res$x %in% textWords,]
# match occurences
res$n <- vapply(1:nrow(res),
function (k) occurences$n[occurences$text == res$x[k]],
integer(1))
rownames(res) <- 1:nrow(res)
# a glimpse at the result
head(res)
# x y n
# 1 great 0.3125 1
# 2 in -0.125 1
# 3 about 0.25 1
# 4 of 0.125 1
# 5 of -0.125 1
# 6 to 0.125 4
可以在此处和此处进行增强(例如,通过.subset2
或完善regex
)。另外,请注意,我在text
中省略了列res
,因为根据定义该列与列x
相同。