我有两个数据框,我想用一个数据框作为参考来组合另一个数据框。
首先,我有数据:
> data
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 72105922753 1 3 27.10816 163 2008 4418.6306
7 72105922765 2 2 24.79000 3 2010 74.3700
8 72105922765 2 2 25.99000 1 2009 25.9900
9 72105922765 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
然后,我有z,这是参考:
> z
upc code
3 1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2 1071917100 50174
1256 8723610700 50174
我想在数据中将数据点组合在一起,其中upc在z中是相同的。
在我给你的样本中,有7种不同的UPC。
1071917100也在z中,代码为50174.但是,此代码中唯一的其他upc是8723610700,它不在数据中。因此,它保持不变。
1461503541,111111111和11111112根本不在z中,因此它们也保持不变。
1153801013,72105922753和72105922765在z,52161中共享相同的代码。因此,我想将所有观察结果与这些upc组合。
我想以一种非常具体的方式做到这一点:
首先,我想选择数据销售额最大的UPC。 1153801013的销售额为9668.616(只是具有该upc的所有销售额的总和)。 72105922753的销售额为4418.631。 72105922765的销售额为406.9118。因此,我选择1153801013作为所有人的upc。
现在选择了这个upc,我想在数据中将72105922753和72105922765更改为1153801013.
现在我们有一个如下所示的数据集:
> data1
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 1153801013 1 3 27.10816 163 2008 4418.6306
7 1153801013 2 2 24.79000 3 2010 74.3700
8 1153801013 2 2 25.99000 1 2009 25.9900
9 1153801013 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
因此,最终数据集应如下所示:
> data2
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.856
2 1153801013 1 2 28.07844 121 2009 3397.491
3 1153801013 2 2 27.74000 8 2009 221.920
4 1153801013 1 1 27.99000 4 2009 111.960
5 1153801013 1 3 27.14448 170 2008 4614.561
6 1153801013 2 2 24.79000 3 2010 74.370
7 1071917100 2 2 300.07000 1 2009 300.070
8 1071917100 1 3 307.07000 2 2008 614.140
9 1071917100 2 3 269.99000 1 2010 269.990
10 1461503541 2 2 0.65200 8 2008 5.216
11 1461503541 2 2 13.99000 11 2010 153.890
12 1461503541 1 1 0.87000 1 2008 0.870
13 11111111 1 1 3.00000 2 2008 6.000
14 11111112 1 1 6.00000 5 2008 30.000
我自己尝试过这样做,但似乎可以比使用dplyr的代码更有效地完成,而且我无法成功完成最后一部分。如果有任何不清楚的地方,请告诉我,并提前非常感谢。
这是dput代码:
data<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 72105922753, 72105922765, 72105922765, 72105922765,
1071917100, 1071917100, 1071917100, 1461503541, 1461503541, 1461503541,
11111111, 11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2,
2, 1, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3,
2, 2, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
z<-structure(list(upc = c(1153801013, 72105922753, 72105922765,
81153801013, 81153801041, 1071917100, 8723610700), code = c(52161L,
52161L, 52161L, 52161L, 52161L, 50174L, 50174L)), .Names = c("upc",
"code"), row.names = c(3L, 1932L, 1934L, 2027L, 2033L, 2L, 1256L
), class = "data.frame")
data1<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1153801013, 1153801013, 1153801013, 1071917100,
1071917100, 1071917100, 1461503541, 1461503541, 1461503541, 11111111,
11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2, 2, 1, 2,
1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2, 2,
2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
data2<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1071917100, 1071917100, 1071917100, 1461503541,
1461503541, 1461503541, 11111111, 11111112), fips_state_code = c(2,
1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2,
1, 3, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.07844,
27.74, 27.99, 27.14448, 24.79, 300.07, 307.07, 269.99, 0.652,
13.99, 0.87, 3, 6), units = c(235, 121, 8, 4, 170, 3, 1, 2, 1,
8, 11, 1, 2, 5), year = c(2008, 2009, 2009, 2009, 2008, 2010,
2009, 2008, 2010, 2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285,
3397.491, 221.92, 111.96, 4614.561, 74.37, 300.07, 614.14, 269.99,
5.216, 153.89, 0.87, 6, 30)), .Names = c("upc", "fips_state_code",
"mymonth", "price", "units", "year", "sales"), row.names = c(NA,
14L), class = c("tbl_df", "data.frame"))
这是我到目前为止所尝试的:
w <- z[match(unique(z$code), z$code),]
w <- plyr::rename(w,c("upc"="upc1"))
data <- merge(x=data,y=z,by="upc",all.x=T,all.y=F)
data <- merge(x=data,y=w,by="code",all.x=T,all.y=F)
data <- within(data, upc2 <- ifelse(!is.na(upc1),upc1,upc))
data$upc <- data$upc2
data$upc1 <- data$upc2 <- data$code <- NULL
data <- data[complete.cases(data),]
attach(data)
data <- aggregate(data,by=list(upc,fips_state_code,year,mymonth),FUN=sum)
data$price <- data$sales / data$units
detach(data)
data$Group.1 <- data$Group.2 <- data$Group.3 <- data$Group.4 <- NULL
我无法弄清楚如何让所选择的upc成为销量最高的那个。如果有一种方法可以用更少的代码行和更优雅的方式来实现这一点,那也很棒。