使用R

时间:2016-02-18 21:59:07

标签: r data-manipulation

我有两个数据框,我想用一个数据框作为参考来组合另一个数据框。

首先,我有数据:

> data
           upc fips_state_code mymonth     price units year     sales
1   1153801013               2       3  25.84620   235 2008 6073.8563
2   1153801013               1       2  28.61981   108 2009 3090.9396
3   1153801013               2       2  27.99000     7 2009  195.9300
4   1153801013               1       1  27.99000     4 2009  111.9600
5   1153801013               1       3  27.99000     7 2008  195.9300
6  72105922753               1       3  27.10816   163 2008 4418.6306
7  72105922765               2       2  24.79000     3 2010   74.3700
8  72105922765               2       2  25.99000     1 2009   25.9900
9  72105922765               1       2  23.58091    13 2009  306.5518
10  1071917100               2       2 300.07000     1 2009  300.0700
11  1071917100               1       3 307.07000     2 2008  614.1400
12  1071917100               2       3 269.99000     1 2010  269.9900
13  1461503541               2       2   0.65200     8 2008    5.2160
14  1461503541               2       2  13.99000    11 2010  153.8900
15  1461503541               1       1   0.87000     1 2008    0.8700
16    11111111               1       1   3.00000     2 2008    6.0000
17    11111112               1       1   6.00000     5 2008   30.0000

然后,我有z,这是参考:

> z
             upc  code
3     1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2     1071917100 50174
1256  8723610700 50174

我想在数据中将数据点组合在一起,其中up​​c在z中是相同的。

在我给你的样本中,有7种不同的UPC。

1071917100也在z中,代码为50174.但是,此代码中唯一的其他upc是8723610700,它不在数据中。因此,它保持不变。

1461503541,111111111和11111112根本不在z中,因此它们也保持不变。

1153801013,72105922753和72105922765在z,52161中共享相同的代码。因此,我想将所有观察结果与这些upc组合。

我想以一种非常具体的方式做到这一点:

  1. 首先,我想选择数据销售额最大的UPC。 1153801013的销售额为9668.616(只是具有该upc的所有销售额的总和)。 72105922753的销售额为4418.631。 72105922765的销售额为406.9118。因此,我选择1153801013作为所有人的upc。

  2. 现在选择了这个upc,我想在数据中将72105922753和72105922765更改为1153801013.

  3. 现在我们有一个如下所示的数据集:

    > data1
              upc fips_state_code mymonth     price units year     sales
    1  1153801013               2       3  25.84620   235 2008 6073.8563
    2  1153801013               1       2  28.61981   108 2009 3090.9396
    3  1153801013               2       2  27.99000     7 2009  195.9300
    4  1153801013               1       1  27.99000     4 2009  111.9600
    5  1153801013               1       3  27.99000     7 2008  195.9300
    6  1153801013               1       3  27.10816   163 2008 4418.6306
    7  1153801013               2       2  24.79000     3 2010   74.3700
    8  1153801013               2       2  25.99000     1 2009   25.9900
    9  1153801013               1       2  23.58091    13 2009  306.5518
    10 1071917100               2       2 300.07000     1 2009  300.0700
    11 1071917100               1       3 307.07000     2 2008  614.1400
    12 1071917100               2       3 269.99000     1 2010  269.9900
    13 1461503541               2       2   0.65200     8 2008    5.2160
    14 1461503541               2       2  13.99000    11 2010  153.8900
    15 1461503541               1       1   0.87000     1 2008    0.8700
    16   11111111               1       1   3.00000     2 2008    6.0000
    17   11111112               1       1   6.00000     5 2008   30.0000
    
    1. 最后,我想将所有数据点与同年,mymonth和fips_state_code结合起来。这种情况的方法是将具有相同upc,fips_state_code,mymonth和year的数据点的销售额和单位数相加,然后重新计算加权价格。 (即,价格=总销售额/总单位数。)
    2. 因此,最终数据集应如下所示:

      > data2
                upc fips_state_code mymonth     price units year    sales
      1  1153801013               2       3  25.84620   235 2008 6073.856
      2  1153801013               1       2  28.07844   121 2009 3397.491
      3  1153801013               2       2  27.74000     8 2009  221.920
      4  1153801013               1       1  27.99000     4 2009  111.960
      5  1153801013               1       3  27.14448   170 2008 4614.561
      6  1153801013               2       2  24.79000     3 2010   74.370
      7  1071917100               2       2 300.07000     1 2009  300.070
      8  1071917100               1       3 307.07000     2 2008  614.140
      9  1071917100               2       3 269.99000     1 2010  269.990
      10 1461503541               2       2   0.65200     8 2008    5.216
      11 1461503541               2       2  13.99000    11 2010  153.890
      12 1461503541               1       1   0.87000     1 2008    0.870
      13   11111111               1       1   3.00000     2 2008    6.000
      14   11111112               1       1   6.00000     5 2008   30.000
      

      我自己尝试过这样做,但似乎可以比使用dplyr的代码更有效地完成,而且我无法成功完成最后一部分。如果有任何不清楚的地方,请告诉我,并提前非常感谢。

      这是dput代码:

      data<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013, 
      1153801013, 72105922753, 72105922765, 72105922765, 72105922765, 
      1071917100, 1071917100, 1071917100, 1461503541, 1461503541, 1461503541, 
      11111111, 11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2, 
      2, 1, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3, 
      2, 2, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208, 
      27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909, 
      300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235, 
      108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008, 
      2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010, 
      2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464, 
      195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817, 
      300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc", 
      "fips_state_code", "mymonth", "price", "units", "year", "sales"
      ), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
      
      z<-structure(list(upc = c(1153801013, 72105922753, 72105922765, 
      81153801013, 81153801041, 1071917100, 8723610700), code = c(52161L, 
      52161L, 52161L, 52161L, 52161L, 50174L, 50174L)), .Names = c("upc", 
      "code"), row.names = c(3L, 1932L, 1934L, 2027L, 2033L, 2L, 1256L
      ), class = "data.frame")
      
      data1<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013, 
      1153801013, 1153801013, 1153801013, 1153801013, 1153801013, 1071917100, 
      1071917100, 1071917100, 1461503541, 1461503541, 1461503541, 11111111, 
      11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 
      1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2, 2, 
      2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208, 
      27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909, 
      300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235, 
      108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008, 
      2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010, 
      2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464, 
      195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817, 
      300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc", 
      "fips_state_code", "mymonth", "price", "units", "year", "sales"
      ), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
      
      data2<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013, 
      1153801013, 1153801013, 1071917100, 1071917100, 1071917100, 1461503541, 
      1461503541, 1461503541, 11111111, 11111112), fips_state_code = c(2, 
      1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 
      1, 3, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.07844, 
      27.74, 27.99, 27.14448, 24.79, 300.07, 307.07, 269.99, 0.652, 
      13.99, 0.87, 3, 6), units = c(235, 121, 8, 4, 170, 3, 1, 2, 1, 
      8, 11, 1, 2, 5), year = c(2008, 2009, 2009, 2009, 2008, 2010, 
      2009, 2008, 2010, 2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 
      3397.491, 221.92, 111.96, 4614.561, 74.37, 300.07, 614.14, 269.99, 
      5.216, 153.89, 0.87, 6, 30)), .Names = c("upc", "fips_state_code", 
      "mymonth", "price", "units", "year", "sales"), row.names = c(NA, 
      14L), class = c("tbl_df", "data.frame"))
      

      这是我到目前为止所尝试的:

      w <- z[match(unique(z$code), z$code),]
      w <- plyr::rename(w,c("upc"="upc1"))
      data <- merge(x=data,y=z,by="upc",all.x=T,all.y=F)
      data <- merge(x=data,y=w,by="code",all.x=T,all.y=F)
      data <- within(data, upc2 <- ifelse(!is.na(upc1),upc1,upc))
      data$upc <- data$upc2
      data$upc1 <- data$upc2 <- data$code <- NULL
      data <- data[complete.cases(data),]
      attach(data)
      data <- aggregate(data,by=list(upc,fips_state_code,year,mymonth),FUN=sum)
      data$price <- data$sales / data$units
      detach(data)
      data$Group.1 <- data$Group.2 <- data$Group.3 <- data$Group.4 <- NULL
      

      我无法弄清楚如何让所选择的upc成为销量最高的那个。如果有一种方法可以用更少的代码行和更优雅的方式来实现这一点,那也很棒。

0 个答案:

没有答案