Question

在操作原始数据后，我们获得了以下data.frame

        ItemID    GroupID mentions
1         601          3     1
2         601          4     1
3         611          3     1
4         661          3     1
5         801          3     1
6         821          3     1
6         841          1     3
6         841          2     3
6         841          3     3
6         841          4     3

我有10000条这样的记录，我的第一个目标是计算代表所有4个GroupID的项目。首先，我试图通过绘图来直观地做到这一点。

ggplot(item.stats, aes(x=ItemID, y=mentions, fill=GroupID)) + 
  geom_bar(stat='identity', position='dodge')

使用大型数据集，这看起来并不合理。什么是最好的方式来了解有多少项目代表所有群体并提及提及。

在过滤后的上述示例中，它应该只有：

        ItemID    GroupID mentions
6         841          1     3
6         841          2     3
6         841          3     3
6         841          4     3

尝试获得有意义的可视化：

test.with.id <- transform(test,id=as.numeric(factor(ItemID)))
ggplot(test.with.id, aes(x=id, y=mentions, fill=GroupID)) + 
  geom_histogram(stat='identity', position='stack', binwidth = 2)

可能与此类似 How to plot multiple stacked histograms together in R?

Answer 1

您可以按GroupID分组，然后根据所有4个组ID是否在df %>% group_by(ItemID) %>% filter(all(1:4 %in% GroupID)) # A tibble: 4 x 3 # Groups: ItemID [1] # ItemID GroupID mentions # <int> <int> <int> #1 841 1 3 #2 841 2 3 #3 841 3 3 #4 841 4 3列中进行过滤：

config.yml

r删除不代表所有组的记录

1 个答案: