Question

我有2个数据帧。一个是供应商列表：

      vendor
1     apple
2     samsung
3     whirlpool
etc
.
.
.

另一篇是有关特定供应商的文章：

nbr     title     content
1       title 1   This is an article about apple
2       title 2   This is an article about whirlpool
3       title 3   This is an article about samsung
4       title 4   This is an article about apple and samsung
5       title 5   This is an article about none of them
etc
.
.
.

我尝试使用stringr包中的许多功能，但我不想只计算一个术语，而是想计算整个供应商列表。我曾尝试使用dplyr进行分组和计数，但是我也无法按照我想要的方式进行操作。

最后，我想输出2个：在所有文章中提到每个供应商的次数。

apple       2
samsung     2
whirlpool   1
etc.
.
.
.

我还想查看文章中提到每个供应商的次数：

title     apple     samsung     whirlpool    etc...
title 1       1
title 2                                 1
title 3                   1
title 4       1           1
title 5
etc.
.
.
.

Answer 1

这是一种解决方案：

mentions = stringr::str_extract_all(art$content, pattern = paste(v$vendor, collapse = "|"))
table(unlist(lapply(mentions, unique)))
# apple   samsung whirlpool 
#     2         2         1 

mentions = lapply(mentions, factor, levels = v$vendor)
t(sapply(mentions, table))
#         apple samsung whirlpool
# title 1     1       0         0
# title 2     0       0         1
# title 3     0       1         0
# title 4     1       1         0
# title 5     0       0         0

使用此数据：

v = read.table(text = "      vendor
1     apple
2     samsung
3     whirlpool", header = T, stringsAsFactors = F)

art = read.table(text = "nbr     title     content
1       'title 1'   'This is an article about apple'
2       'title 2'   'This is an article about whirlpool'
3       'title 3'   'This is an article about samsung'
4       'title 4'   'This is an article about apple and samsung'
5       'title 5'   'This is an article about none of them'", header = T, stringsAsFactors = F)

如果您的供应商名称可能混在其他单词中，则可能需要在将它们用作正则表达式模式之前和之后添加单词边界"\\b"。

Answer 2

假设您将这两个称为vendor_df和df：

library(tidyverse)

df %>% 
  separate_rows(content, sep=" ") %>% 
  inner_join(vendor_df, by = c("content" = "vendor")) %>% 
  count(content)

df %>% 
  separate_rows(content, sep=" ") %>% 
  inner_join(vendor_df, by = c("content" = "vendor")) %>% 
  mutate(value = 1) %>% 
  spread(key = content, value = value, fill = 0)

Answer 3

使用@Gregor中的示例数据，对于第一部分，您可以执行以下操作：

colSums(sapply(v$vendor, function(x) grepl(x, art$content)))

apple   samsung whirlpool 
    2         2         1

第二部分：

mentions <- +(sapply(v$vendor, function(x) grepl(x, art$content)))
rownames(mentions) <- art$title

        apple samsung whirlpool
title 1     1       0         0
title 2     0       0         1
title 3     0       1         0
title 4     1       1         0
title 5     0       0         0

如何计算一个数据帧的元素出现在另一个数据帧中的次数

3 个答案: