根据R中的另一个数据帧组合索引?

时间:2016-02-16 21:33:47

标签: r dataframe

我有两个data.frame,我想用一个作为参考来结合另一个的观察结果。

首先,我有data

> data
Source: local data frame [15 x 7]

           upc fips_state_code mymonth     price units  year     sales
         (dbl)           (int)   (dbl)     (dbl) (int) (dbl)     (dbl)
1   1153801013               2       3  25.84620   235  2008 6073.8563
2   1153801013               1       2  28.61981   108  2009 3090.9396
3   1153801013               2       2  27.99000     7  2009  195.9300
4   1153801013               1       1  27.99000     4  2009  111.9600
5   1153801013               1       3  27.99000     7  2008  195.9300
6  72105922753               1       3  27.10816   163  2008 4418.6306
7  72105922765               2       2  24.79000     3  2010   74.3700
8  72105922765               2       2  25.99000     1  2009   25.9900
9  72105922765               1       2  23.58091    13  2009  306.5518
10  1071917100               2       2 300.07000     1  2009  300.0700
11  1071917100               1       3 307.07000     2  2008  614.1400
12  1071917100               2       3 269.99000     1  2010  269.9900
13  1461503541               2       2   0.65200     8  2008    5.2160
14  1461503541               2       2  13.99000    11  2010  153.8900
15  1461503541               1       1   0.87000     1  2008    0.8700

然后,我有z,这是参考:

> z
             upc  code
3     1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2     1071917100 50174
1256  8723610700 50174

我希望合并dataupcz中相同的数据点。

在我给你的样本中,有5个不同的upc s。

1071917100也位于z,代码为50174。但是,此代码中唯一的其他upc8723610700,不在data中。因此,它保持不变。

1461503541根本不在z,所以它也保持不变。

11538010137210592275372105922765都在z52161中共享相同的代码。因此,我想将所有观察结果与这些upc s。

结合起来

我想以一种非常具体的方式做到这一点:

  1. 首先,我想在数据中选择upc的{​​{1}}。 sales销售额为1153801013(只是所有9668.616sales的总和)。 upc 72105922753中有4418.631sales 72105922765中有406.9118。因此,我选择sales作为所有1153801013

  2. 现在选择了此upc,我想在数据中将upc72105922753更改为72105922765

  3. 现在我们有一个如下所示的数据集:

    1153801013
    1. 最后,我想将所有数据点与相同的> data1 Source: local data frame [15 x 7] upc fips_state_code mymonth price units year sales (dbl) (int) (dbl) (dbl) (int) (dbl) (dbl) 1 1153801013 2 3 25.84620 235 2008 6073.8563 2 1153801013 1 2 28.61981 108 2009 3090.9396 3 1153801013 2 2 27.99000 7 2009 195.9300 4 1153801013 1 1 27.99000 4 2009 111.9600 5 1153801013 1 3 27.99000 7 2008 195.9300 6 1153801013 1 3 27.10816 163 2008 4418.6306 7 1153801013 2 2 24.79000 3 2010 74.3700 8 1153801013 2 2 25.99000 1 2009 25.9900 9 1153801013 1 2 23.58091 13 2009 306.5518 10 1071917100 2 2 300.07000 1 2009 300.0700 11 1071917100 1 3 307.07000 2 2008 614.1400 12 1071917100 2 3 269.99000 1 2010 269.9900 13 1461503541 2 2 0.65200 8 2008 5.2160 14 1461503541 2 2 13.99000 11 2010 153.8900 15 1461503541 1 1 0.87000 1 2008 0.8700 yearmymonth合并。这种情况的实现方式是将fips_state_codesales个数据点与unitsupcfips_state_code和{{1}相加然后重新计算加权价格。 (即mymonth。)
    2. 因此,最终数据集应如下所示:

      year

      我确实尝试过这样做,但它花了我很多行代码,而且我无法成功完成最后一部分。如果有任何不清楚的地方,请告诉我,并提前非常感谢。

      这是dput代码:

      price = total Sales / total Units

2 个答案:

答案 0 :(得分:1)

我认为这很有效。最终结果的行与data2的顺序不同,但一目了然看起来相同。

# join data
joined = data %>% left_join(z)

# set aside the rows not in z
not_in_z = filter(joined, is.na(code))

modified = joined %>%
    filter(!is.na(code)) %>%     # for the rows in z
    group_by(code) %>%           # group by code
    arrange(desc(sales)) %>%     # sort by sales (so highest sales is first)
    mutate(upc = first(upc)) %>% # change all UPC codes to the one with
                                 # highest sales (within group)
    bind_rows(not_in_z)          # tack back on the rows that weren't in z

modified数据应该与您的data1匹配(它也有一个code列,但你可以删除它。)

final = modified %>%
    ungroup() %>%                # redo the grouping
    group_by(upc, fips_state_code, mymonth, year) %>%
    summarize(                   # add your summary columns
        sales = sum(sales),
        units = sum(units),
        price = sales / units
    ) %>%
    select(    # get columns in the same order as your "data2"
        upc, fips_state_code, mymonth, price, units, year, sales
    )
final
# Source: local data frame [12 x 7]
# Groups: upc, fips_state_code, mymonth [10]
# 
#           upc fips_state_code mymonth     price units  year    sales
#         (dbl)           (int)   (dbl)     (dbl) (int) (dbl)    (dbl)
# 1  1071917100               1       3 307.07000     2  2008  614.140
# 2  1071917100               2       2 300.07000     1  2009  300.070
# 3  1071917100               2       3 269.99000     1  2010  269.990
# 4  1153801013               1       1  27.99000     4  2009  111.960
# 5  1153801013               1       2  28.07844   121  2009 3397.491
# 6  1153801013               1       3  27.14447   170  2008 4614.561
# 7  1153801013               2       2  27.74000     8  2009  221.920
# 8  1153801013               2       2  24.79000     3  2010   74.370
# 9  1153801013               2       3  25.84620   235  2008 6073.856
# 10 1461503541               1       1   0.87000     1  2008    0.870
# 11 1461503541               2       2   0.65200     8  2008    5.216
# 12 1461503541               2       2  13.99000    11  2010  153.890

答案 1 :(得分:1)

这是data.table方法。

首先初始化data.table

library(data.table)
setDT(data); setDT(z)

重新分配upc

#merge to add `code` to `data`
data[z, code := i.code, on = "upc"]

#add a new column with sales by `upc`
data[ , upc_sales := sum(sales), by = upc]

#re-assign
data[ , upc := upc[which.max(upc_sales)], by = code]

骨料:

data2 <- data[ , .(sales = sum(sales), units = sum(units)),
               by = .(upc, fips_state_code, mymonth, year)
               ][ , price := sales / units]

data2相比存在细微差别,但setcolorder:= NULL都可以轻松修复这些差异。

这也可以用两个命令来完成,但它有点不太清晰:

data[z, code := i.code, on = "upc"]

data[, upc := 
        upc[which.max(.SD[ , sum(sales), by = upc]$V1)], 
      by = code][ , {sl <- sum(sales); us <- sum(units)
        .(sales = sl, units = us, price = sl/us)},
               by = .(upc, fips_state_code, mymonth, year)]