匹配两个不同数据框中的列并减去相应的值

时间:2019-08-22 14:28:29

标签: r

我有一个单词数据框架(推文已被标记化),该单词的使用次数,其所附的情感得分以及总得分(n *值)。我创建了另一个数据框,这些数据框是我语料库中所有跟负词的单词(因此,我制作了双字并过滤掉word_1为负词)。

我想从原始数据帧中减去负数,以便它显示单词的净额。

library(tidyverse)
library(tidyr)
library(tidytext)
tweets <- read_csv("http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv")

custom_stop_words <- bind_rows(tibble(word = c("https", "t.co", "rt", "amp"), 
      lexicon = c("custom")), stop_words)


tweet_tokens <- tweets %>% 
  select(user_id, user_key, text, created_str) %>% 
  na.omit() %>% 
  mutate(row= row_number()) %>% 
  unnest_tokens(word, text, token = "tweets") %>% 
  filter(!word %in% custom_stop_words$word)

sentiment <- tweet_tokens %>% 
  count(word, sort = T) %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  mutate(total_score = n * value)
#df showing contribution of overall sentiment to each word

negation_words <- c("not", "no", "never", "without", "won't", "dont", "doesnt", "doesn't", "don't", "can't") 

bigrams <- tweets %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams. 

bigrams_separated <- bigrams %>% 
  separate(bigram, c("word_1", "word_2"), sep = " ")

not_words <- bigrams_separated %>%
  filter(word_1 %in% negation_words) %>%
  inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
  count(word_2, value, sort = TRUE) %>% 
  mutate(value = value * -1) %>% 
  mutate(contribution = value * n)

我希望结果是一个数据帧。因此,如果情绪显示'matter'出现696次,但not_words df显示它之前被否定274次,则新数据帧的'matter'的n值为422。

1 个答案:

答案 0 :(得分:0)

(没有真正知道具体细节),我认为您在汇总tweet_tokensnot_words数据集方面做得很好。不过,您必须对其稍加修改,以使其按您的意愿(可能?)工作。

  1. 停用mutate(row=...数据框中的tweet_tokens <- ...行,如果不这样做将给您带来麻烦。为了安全起见,请重新运行您的sentiment <- ...数据框。
tweet_tokens <- tweets %>% 
   select(user_id, user_key, text, created_str) %>% 
   na.omit() %>% 
   #mutate(row= row_number()) %>% 
   unnest_tokens(word, text, token = "tweets") %>% 
   filter(!word %in% custom_stop_words$word)
  1. 剪切not_words <- ...数据框的最后三行,因为稍后摘要count(...将不允许您引用数据框。 select(user_id,user_key,created_str,word = word_2)行为您提供了具有与tweet_tokens数据框相同的“标准”的数据框。还要检查我的“ word_2”列现在如何称为“世界”(在新的not_words数据框中)。
not_words <- bigrams_separated %>%
   filter(word_1 %in% negation_words) %>%
   inner_join(get_sentiments("afinn"), by = c(word_2 = "word")) %>%
   select(user_id,user_key,created_str,word = word_2)

现在,对于您的特定示例/案例,当使用单词“ matter ”(对于tweet_tokens)时,我们确实具有696行的数据框...

> matter_tweet = tweet_tokens[tweet_tokens$word=='matter',]
> dim(matter_tweet)
[1] 696   4

,当使用单词“ matter ”(用于not_words)时,我们最终得到一个274行的数据框。

> matter_not = not_words[not_words$word=='matter',]
> dim(matter_not)
[1] 274   4

因此,如果我们仅从matter_not中减去matter_tweet,那么您将找到这422行。
嗯...没有那么快...严格来说,我也确定那不是您真正想要的。

  • 简单准确的答案是:
> anti_join(matter_tweet,matter_not)
Joining, by = c("user_id", "user_key", "created_str", "word")
# A tibble: 429 x 4
      user_id user_key       created_str         word  
        <dbl> <chr>          <dttm>              <chr> 
 1 1671234620 hyddrox        2016-10-17 07:22:47 matter
 2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
 3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
 4 1680366068 willisbonnerr  2017-02-14 09:14:24 matter
 5 2533221819 lazykstafford  2015-12-25 13:41:12 matter
 6 1833223908 dorothiebell   2016-09-29 21:08:14 matter
 7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
 8 2606301939 finley1589     2016-09-19 08:24:37 matter
 9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 419 more rows
  • 现在让我解释一下为什么要输入422时却出现 429 行。
> #-not taking into account NAs in the 'user_id column' (you'll decide what to do with that issue later, I guess)
> matter_not_clean = matter_not[!is.na(matter_not$user_id),]
> dim(matter_not_clean)
[1] 256   4
> #-the above dataframe contains also duplicates, which we 'have to?' get rid off of them
> #-the 'matter' dataframe is the cleanest you can have
> matter = matter_not_clean[!duplicated(matter_not_clean),]
> dim(matter)
[1] 250   4

#-you'd be tempted to say that 696-250=446 are the columns you'd want now;
#-...which is not true as some of the 250 rows from 'matter' are also duplicated in
#-...'matter_tweet', but that should not worry you. You can later delete them... if that's what you want.

> #-then I jump to 'data.table' as it helps me to prove my point
> library(data.table)
> #-transforming those 'tbl_df' into 'data.table'
> mt = as.data.table(matter_tweet)
> mm = as.data.table(matter)

> #-I check if (all) 'mm' is contained in 'mt'
> test = mt[mm,on=names(mt)]
> dim(test)
[1] 267   4

这些 267 行是您想要摆脱的行!因此,您要查找的数据帧为696-267 = 429 行!。

> #-the above implies that there are indeed duplicates... but this doesn't mean that all 'mm' is contain is contained in 'mt'
> #-now I remove the duplicates
> test[!duplicated(test),]
          user_id       user_key         created_str   word
  1: 1.518857e+09   nojonathonno 2016-11-08 10:36:14 matter
  2: 1.594887e+09 jery_robertsyo 2016-11-08 20:57:07 matter
  3: 1.617939e+09      paulinett 2017-01-14 16:33:38 matter
  4: 1.617939e+09      paulinett 2017-03-05 18:16:48 matter
  5: 1.617939e+09      paulinett 2017-04-03 03:21:34 matter
 ---                                                       
246: 4.508631e+09 thefoundingson 2017-03-23 13:40:00 matter
247: 4.508631e+09 thefoundingson 2017-03-29 01:05:01 matter
248: 4.840552e+09    blacktolive 2016-07-19 15:32:04 matter
249: 4.859142e+09  trayneshacole 2016-04-09 23:16:13 matter
250: 7.532149e+17  margarethkurz 2017-03-05 16:31:43 matter
> #-and here I test that all 'matter' is in 'matter_tweet', which IT IS!
> identical(mm,test[!duplicated(test),])
[1] TRUE

> #-in this way we keep the duplicates from/in 'matter_tweet' 
> answer = mt[!mm,on=names(mt)]
> dim(answer)
[1] 429   4
> #-if we remove the duplicates we end up with a dataframe of 415 columns
> #-...and this is where I am not sure if that's what you want
> answer[!duplicated(answer),]
        user_id        user_key         created_str   word
  1: 1671234620         hyddrox 2016-10-17 07:22:47 matter
  2: 1623180199  jeffreykahunas 2016-09-14 12:53:37 matter
  3: 1594887416  jery_robertsyo 2016-10-21 14:24:05 matter
  4: 1680366068   willisbonnerr 2017-02-14 09:14:24 matter
  5: 2533221819   lazykstafford 2015-12-25 13:41:12 matter
 ---                                                      
411: 4508630900  thefoundingson 2016-09-13 12:15:03 matter
412: 1655194147   melanymelanin 2016-02-21 02:32:50 matter
413: 1684524144    datwisenigga 2017-04-27 02:45:25 matter
414: 1660771422 garrettsimpson_ 2016-10-14 01:14:04 matter
415: 1671234620         hyddrox 2017-02-19 19:40:39 matter

> #-you'll get this same 'answer' if you do:
> setdiff(matter_tweet,matter)
# A tibble: 415 x 4
      user_id user_key       created_str         word  
        <dbl> <chr>          <dttm>              <chr> 
 1 1671234620 hyddrox        2016-10-17 07:22:47 matter
 2 1623180199 jeffreykahunas 2016-09-14 12:53:37 matter
 3 1594887416 jery_robertsyo 2016-10-21 14:24:05 matter
 4 1680366068 willisbonnerr  2017-02-14 09:14:24 matter
 5 2533221819 lazykstafford  2015-12-25 13:41:12 matter
 6 1833223908 dorothiebell   2016-09-29 21:08:14 matter
 7 2587100717 judelambertusa 2014-12-13 14:41:08 matter
 8 2606301939 finley1589     2016-09-19 08:24:37 matter
 9 4272870988 pamela_moore13 2016-08-03 18:21:01 matter
10 2531159968 traceyhappymom 2017-01-14 12:07:55 matter
# … with 405 more rows
> #-nut now you know why ;)

> #-testing equality in both methods
> identical(answer[1:429,],as.data.table(anti_join(matter_tweet,matter_not))[1:429,])
Joining, by = c("user_id", "user_key", "created_str", "word")
[1] TRUE

结论1:如果不希望在anti_join(matter_tweet,matter)数据框中重复值,请执行tweet_tokens;否则,请执行setdiff(matter_tweet,matter)

结论2::如果您注意到anti_join(matter_tweet,matter_not)anti_join(matter_tweet,matter)给出了相同的答案。这意味着anti_join(...在其运作过程中并未考虑到NA。