我有一个单词数据框架(推文已被标记化),该单词的使用次数,其所附的情感得分以及总得分(n *值)。我创建了另一个数据框,这些数据框是我语料库中所有跟负词的单词(因此,我制作了双字并过滤掉word_1为负词)。
我想从原始数据帧中减去负数,以便它显示单词的净额。
library(tidyverse)
library(tidyr)
library(tidytext)
tweets <- read_csv("http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv")
custom_stop_words <- bind_rows(tibble(word = c("https", "t.co", "rt", "amp"),
lexicon = c("custom")), stop_words)
tweet_tokens <- tweets %>%
select(user_id, user_key, text, created_str) %>%
na.omit() %>%
mutate(row= row_number()) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)
sentiment <- tweet_tokens %>%
count(word, sort = T) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(total_score = n * value)
#df showing contribution of overall sentiment to each word
negation_words <- c("not", "no", "never", "without", "won't", "dont", "doesnt", "doesn't", "don't", "can't")
bigrams <- tweets %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams.
bigrams_separated <- bigrams %>%
separate(bigram, c("word_1", "word_2"), sep = " ")
not_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
count(word_2, value, sort = TRUE) %>%
mutate(value = value * -1) %>%
mutate(contribution = value * n)
我希望结果是一个数据帧。因此,如果情绪显示'matter'出现696次,但not_words df显示它之前被否定274次,则新数据帧的'matter'的n值为422。
答案 0 :(得分:0)
(没有真正知道具体细节),我认为您在汇总tweet_tokens
和not_words
数据集方面做得很好。不过,您必须对其稍加修改,以使其按您的意愿(可能?)工作。
mutate(row=...
数据框中的tweet_tokens <- ...
行,如果不这样做将给您带来麻烦。为了安全起见,请重新运行您的sentiment <- ...
数据框。tweet_tokens <- tweets %>%
select(user_id, user_key, text, created_str) %>%
na.omit() %>%
#mutate(row= row_number()) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)
not_words <- ...
数据框的最后三行,因为稍后摘要count(...
将不允许您引用数据框。 select(user_id,user_key,created_str,word = word_2)
行为您提供了具有与tweet_tokens
数据框相同的“标准”的数据框。还要检查我的“ word_2”列现在如何称为“世界”(在新的not_words
数据框中)。not_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
select(user_id,user_key,created_str,word = word_2)
现在,对于您的特定示例/案例,当使用单词“ matter ”(对于tweet_tokens
)时,我们确实具有696行的数据框...
> matter_tweet = tweet_tokens[tweet_tokens$word=='matter',]
> dim(matter_tweet)
[1] 696 4
,当使用单词“ matter ”(用于not_words
)时,我们最终得到一个274行的数据框。
> matter_not = not_words[not_words$word=='matter',]
> dim(matter_not)
[1] 274 4
因此,如果我们仅从matter_not
中减去matter_tweet
,那么您将找到这422行。
嗯...没有那么快...严格来说,我也确定那不是您真正想要的。
> anti_join(matter_tweet,matter_not)
Joining, by = c("user_id", "user_key", "created_str", "word")
# A tibble: 429 x 4
user_id user_key created_str word
<dbl> <chr> <dttm> <chr>
1 1671234620 hyddrox 2016-10-17 07:22:47 matter
2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5 2533221819 lazykstafford 2015-12-25 13:41:12 matter
6 1833223908 dorothiebell 2016-09-29 21:08:14 matter
7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
8 2606301939 finley1589 2016-09-19 08:24:37 matter
9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 419 more rows
> #-not taking into account NAs in the 'user_id column' (you'll decide what to do with that issue later, I guess)
> matter_not_clean = matter_not[!is.na(matter_not$user_id),]
> dim(matter_not_clean)
[1] 256 4
> #-the above dataframe contains also duplicates, which we 'have to?' get rid off of them
> #-the 'matter' dataframe is the cleanest you can have
> matter = matter_not_clean[!duplicated(matter_not_clean),]
> dim(matter)
[1] 250 4
#-you'd be tempted to say that 696-250=446 are the columns you'd want now;
#-...which is not true as some of the 250 rows from 'matter' are also duplicated in
#-...'matter_tweet', but that should not worry you. You can later delete them... if that's what you want.
> #-then I jump to 'data.table' as it helps me to prove my point
> library(data.table)
> #-transforming those 'tbl_df' into 'data.table'
> mt = as.data.table(matter_tweet)
> mm = as.data.table(matter)
> #-I check if (all) 'mm' is contained in 'mt'
> test = mt[mm,on=names(mt)]
> dim(test)
[1] 267 4
这些 267 行是您想要摆脱的行!因此,您要查找的数据帧为696-267 = 429 行!。
> #-the above implies that there are indeed duplicates... but this doesn't mean that all 'mm' is contain is contained in 'mt'
> #-now I remove the duplicates
> test[!duplicated(test),]
user_id user_key created_str word
1: 1.518857e+09 nojonathonno 2016-11-08 10:36:14 matter
2: 1.594887e+09 jery_robertsyo 2016-11-08 20:57:07 matter
3: 1.617939e+09 paulinett 2017-01-14 16:33:38 matter
4: 1.617939e+09 paulinett 2017-03-05 18:16:48 matter
5: 1.617939e+09 paulinett 2017-04-03 03:21:34 matter
---
246: 4.508631e+09 thefoundingson 2017-03-23 13:40:00 matter
247: 4.508631e+09 thefoundingson 2017-03-29 01:05:01 matter
248: 4.840552e+09 blacktolive 2016-07-19 15:32:04 matter
249: 4.859142e+09 trayneshacole 2016-04-09 23:16:13 matter
250: 7.532149e+17 margarethkurz 2017-03-05 16:31:43 matter
> #-and here I test that all 'matter' is in 'matter_tweet', which IT IS!
> identical(mm,test[!duplicated(test),])
[1] TRUE
> #-in this way we keep the duplicates from/in 'matter_tweet'
> answer = mt[!mm,on=names(mt)]
> dim(answer)
[1] 429 4
> #-if we remove the duplicates we end up with a dataframe of 415 columns
> #-...and this is where I am not sure if that's what you want
> answer[!duplicated(answer),]
user_id user_key created_str word
1: 1671234620 hyddrox 2016-10-17 07:22:47 matter
2: 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3: 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4: 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5: 2533221819 lazykstafford 2015-12-25 13:41:12 matter
---
411: 4508630900 thefoundingson 2016-09-13 12:15:03 matter
412: 1655194147 melanymelanin 2016-02-21 02:32:50 matter
413: 1684524144 datwisenigga 2017-04-27 02:45:25 matter
414: 1660771422 garrettsimpson_ 2016-10-14 01:14:04 matter
415: 1671234620 hyddrox 2017-02-19 19:40:39 matter
> #-you'll get this same 'answer' if you do:
> setdiff(matter_tweet,matter)
# A tibble: 415 x 4
user_id user_key created_str word
<dbl> <chr> <dttm> <chr>
1 1671234620 hyddrox 2016-10-17 07:22:47 matter
2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
4 1680366068 willisbonnerr 2017-02-14 09:14:24 matter
5 2533221819 lazykstafford 2015-12-25 13:41:12 matter
6 1833223908 dorothiebell 2016-09-29 21:08:14 matter
7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
8 2606301939 finley1589 2016-09-19 08:24:37 matter
9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 405 more rows
> #-nut now you know why ;)
> #-testing equality in both methods
> identical(answer[1:429,],as.data.table(anti_join(matter_tweet,matter_not))[1:429,])
Joining, by = c("user_id", "user_key", "created_str", "word")
[1] TRUE
结论1:如果不希望在anti_join(matter_tweet,matter)
数据框中重复值,请执行tweet_tokens
;否则,请执行setdiff(matter_tweet,matter)
。
结论2::如果您注意到anti_join(matter_tweet,matter_not)
和anti_join(matter_tweet,matter)
给出了相同的答案。这意味着anti_join(...
在其运作过程中并未考虑到NA。