如何计算一个数据帧的元素出现在另一个数据帧中的次数

时间:2019-08-02 19:54:24

标签: r stringr

我有2个数据帧。一个是供应商列表:

      vendor
1     apple
2     samsung
3     whirlpool
etc
.
.
.

另一篇是有关特定供应商的文章:

nbr     title     content
1       title 1   This is an article about apple
2       title 2   This is an article about whirlpool
3       title 3   This is an article about samsung
4       title 4   This is an article about apple and samsung
5       title 5   This is an article about none of them
etc
.
.
.

我尝试使用stringr包中的许多功能,但我不想只计算一个术语,而是想计算整个供应商列表。我曾尝试使用dplyr进行分组和计数,但是我也无法按照我想要的方式进行操作。

最后,我想输出2个:在所有文章中提到每个供应商的次数。

apple       2
samsung     2
whirlpool   1
etc.
.
.
.

我还想查看文章中提到每个供应商的次数:

title     apple     samsung     whirlpool    etc...
title 1       1
title 2                                 1
title 3                   1
title 4       1           1
title 5
etc.
.
.
.

3 个答案:

答案 0 :(得分:4)

这是一种解决方案:

mentions = stringr::str_extract_all(art$content, pattern = paste(v$vendor, collapse = "|"))
table(unlist(lapply(mentions, unique)))
# apple   samsung whirlpool 
#     2         2         1 

mentions = lapply(mentions, factor, levels = v$vendor)
t(sapply(mentions, table))
#         apple samsung whirlpool
# title 1     1       0         0
# title 2     0       0         1
# title 3     0       1         0
# title 4     1       1         0
# title 5     0       0         0

使用此数据:

v = read.table(text = "      vendor
1     apple
2     samsung
3     whirlpool", header = T, stringsAsFactors = F)

art = read.table(text = "nbr     title     content
1       'title 1'   'This is an article about apple'
2       'title 2'   'This is an article about whirlpool'
3       'title 3'   'This is an article about samsung'
4       'title 4'   'This is an article about apple and samsung'
5       'title 5'   'This is an article about none of them'", header = T, stringsAsFactors = F)

如果您的供应商名称可能混在其他单词中,则可能需要在将它们用作正则表达式模式之前和之后添加单词边界"\\b"

答案 1 :(得分:2)

假设您将这两个称为vendor_dfdf

library(tidyverse)

df %>% 
  separate_rows(content, sep=" ") %>% 
  inner_join(vendor_df, by = c("content" = "vendor")) %>% 
  count(content)

df %>% 
  separate_rows(content, sep=" ") %>% 
  inner_join(vendor_df, by = c("content" = "vendor")) %>% 
  mutate(value = 1) %>% 
  spread(key = content, value = value, fill = 0)

答案 2 :(得分:0)

使用@Gregor中的示例数据,对于第一部分,您可以执行以下操作:

colSums(sapply(v$vendor, function(x) grepl(x, art$content)))

apple   samsung whirlpool 
    2         2         1 

第二部分:

mentions <- +(sapply(v$vendor, function(x) grepl(x, art$content)))
rownames(mentions) <- art$title

        apple samsung whirlpool
title 1     1       0         0
title 2     0       0         1
title 3     0       1         0
title 4     1       1         0
title 5     0       0         0