我有2个数据帧。一个是供应商列表:
vendor
1 apple
2 samsung
3 whirlpool
etc
.
.
.
另一篇是有关特定供应商的文章:
nbr title content
1 title 1 This is an article about apple
2 title 2 This is an article about whirlpool
3 title 3 This is an article about samsung
4 title 4 This is an article about apple and samsung
5 title 5 This is an article about none of them
etc
.
.
.
我尝试使用stringr包中的许多功能,但我不想只计算一个术语,而是想计算整个供应商列表。我曾尝试使用dplyr进行分组和计数,但是我也无法按照我想要的方式进行操作。
最后,我想输出2个:在所有文章中提到每个供应商的次数。
apple 2
samsung 2
whirlpool 1
etc.
.
.
.
我还想查看文章中提到每个供应商的次数:
title apple samsung whirlpool etc...
title 1 1
title 2 1
title 3 1
title 4 1 1
title 5
etc.
.
.
.
答案 0 :(得分:4)
这是一种解决方案:
mentions = stringr::str_extract_all(art$content, pattern = paste(v$vendor, collapse = "|"))
table(unlist(lapply(mentions, unique)))
# apple samsung whirlpool
# 2 2 1
mentions = lapply(mentions, factor, levels = v$vendor)
t(sapply(mentions, table))
# apple samsung whirlpool
# title 1 1 0 0
# title 2 0 0 1
# title 3 0 1 0
# title 4 1 1 0
# title 5 0 0 0
使用此数据:
v = read.table(text = " vendor
1 apple
2 samsung
3 whirlpool", header = T, stringsAsFactors = F)
art = read.table(text = "nbr title content
1 'title 1' 'This is an article about apple'
2 'title 2' 'This is an article about whirlpool'
3 'title 3' 'This is an article about samsung'
4 'title 4' 'This is an article about apple and samsung'
5 'title 5' 'This is an article about none of them'", header = T, stringsAsFactors = F)
如果您的供应商名称可能混在其他单词中,则可能需要在将它们用作正则表达式模式之前和之后添加单词边界"\\b"
。
答案 1 :(得分:2)
假设您将这两个称为vendor_df
和df
:
library(tidyverse)
df %>%
separate_rows(content, sep=" ") %>%
inner_join(vendor_df, by = c("content" = "vendor")) %>%
count(content)
df %>%
separate_rows(content, sep=" ") %>%
inner_join(vendor_df, by = c("content" = "vendor")) %>%
mutate(value = 1) %>%
spread(key = content, value = value, fill = 0)
答案 2 :(得分:0)
使用@Gregor中的示例数据,对于第一部分,您可以执行以下操作:
colSums(sapply(v$vendor, function(x) grepl(x, art$content)))
apple samsung whirlpool
2 2 1
第二部分:
mentions <- +(sapply(v$vendor, function(x) grepl(x, art$content)))
rownames(mentions) <- art$title
apple samsung whirlpool
title 1 1 0 0
title 2 0 0 1
title 3 0 1 0
title 4 1 1 0
title 5 0 0 0