Question

我有两个按日期量的数据框。它们都是相同的数据，但有一个是过滤的。我想在任何一天绘制过滤数据与未过滤数据之间比率的趋势线 - 但是我很难模塑数据框以使它们具有可比性。这是一个例子：

unFiltered <- data.frame(date = c("01-01-2015", "01-01-2015", "01-02-2015"), item = c("item1", "item2", "item1"), volume = c(100, 100, 50))

filtered <- data.frame(date = c("01-01-2015", "01-03-2015"), item = c("item1", "item1"), volume = c(10, 40))

从这些数据集中，我想构建第三个数据集，即＆＃34;正在过滤的未过滤项目量的百分比＆＃34;。也就是说，我想要一个如下所示的数据框：

    date          item    percentage
1 "01-01-2015"    item1   .1
2 "01-01-2015"    item2    0
3 "01-02-2015"    item1    0
4 "01-02-2015"    item2    0
5 "01-03-2015"    item1   .8
6 "01-03-2015"    item2    0

（注意：两个数据框都没有6个条目 - 但结果数据框的唯一值为item 和唯一值date。）

有人有什么想法吗？我已经坚持了大约2个小时，摸索着for循环，合并，加入，手动创建数据框等等。如果有人有解决方案，你会介意解释说的是什么解决方案呢？（我仍然对R很感兴趣，而且我经常阅读有人写的代码而不知道它为什么会有效）。

Answer 1

默认情况下，merge只会保留两个数据框中的行，因此我们设置all.x = T以确保它保留x数据框中的所有行。默认情况下，它还会尝试匹配具有相同名称的所有列;由于我们不想与volume列匹配，因此我们会在by参数中指定执行要匹配的列：

both = merge(x = unFiltered, y = filtered,
             all.x = TRUE, by = c("date", "item"))

这为我们提供了每个来源的volume列的变体。（我们也可以在原始数据框中重命名卷列以获得相同的结果，如在Laterow的评论中那样。）

both  # just checking out what's there
#         date  item volume.x volume.y
# 1 01-01-2015 item1      100       10
# 2 01-01-2015 item2      100       NA
# 3 01-02-2015 item1       50       NA

# fill in missing values with 0
both$volume.y[is.na(both$volume.y)] = 0

# calculate the percentage
both$percentage = both$volume.y / both$volume.x

both  # demonstrate the result
#         date  item volume.x volume.y percentage
# 1 01-01-2015 item1      100       10        0.1
# 2 01-01-2015 item2      100        0        0.0
# 3 01-02-2015 item1       50        0        0.0

# drop unwanted columns
both = both[c("date", "item", "percentage")]

我评论并展示了上述结果，但我想确定它是多么简单。唯一需要运行的命令是：

both = merge(x = unFiltered, y = filtered,
             all.x = TRUE, by = c("date", "item"))
both$volume.y[is.na(both$volume.y)] = 0
both$percentage = both$volume.y / both$volume.x
both = both[c("date", "item", "percentage")]

有些人（像我一样！）发现dplyr更具可读性。这是dplyr版本的相同内容：

library(dplyr)
unFiltered %>%
    rename(all_volume = volume) %>%
    left_join(filtered) %>%
    mutate(volume = ifelse(is.na(volume), 0, volume),
           percentage = volume / all_volume) %>%
    select(-all_volume, -volume)

#         date  item percentage
# 1 01-01-2015 item1        0.1
# 2 01-01-2015 item2        0.0
# 3 01-02-2015 item1        0.0

Answer 2

因此，您的示例代码不足，或者您的问题描述是。特别是，如果一个数据集从另一个数据集中过滤掉，那么您永远不会期望过滤中的未经过滤的条目。

无论如何，这里至少解决了你的一个问题：

itemsAndDate = unique(rbind(unFiltered[,c("date", "item")],
                            filtered[,c("date", "item")]))

## Here is how you would expand the concept to unobserved things.
combos = expand.grid(itemsAndDate[,1], itemsAndDate[,2])
head(combos)

combined = merge(merge(itemsAndDate, unFiltered, by = c("date", "item"), all.x = TRUE), filtered, by = c("date", "item"), all.x = TRUE)
head(combined)

R - 比较不同大小的数据帧

2 个答案: