R合并重复项

时间:2018-08-16 23:34:41

标签: r dataframe merge

我的数据包括零件号,销售日期用年,月,季度,日表示。可以在同一天出售同一零件,但是发票编号不同,因此每天会有重复的零件编号。数据如下所示:

Year <- c(2016, 2016, 2016, 2017, 2017, 2018, 2018)
Month <- c("Aug", "Sep", "Sep", "Aug", "Sep", "Aug", "Sep")
Day <- c(1, 2, 2, 1, 2, 1, 2)
Revenue <- c(147, 200, 250, 300, 200, 250, 150)
PartNumber <- c("1234", "5678", "5678", "1234", "5678", "5678", "9101")

testdf <- data.frame(Year, Month, Day, Revenue, PartNumber)
> testdf
  Year Month Day Revenue PartNumber
1 2016   Aug   1     147       1234
2 2016   Sep   2     200       5678
3 2016   Sep   2     250       5678
4 2017   Aug   1     300       1234
5 2017   Sep   2     200       5678
6 2018   Aug   1     250       5678
7 2018   Sep   2     150       9101

我一直在做的是创建一个新的数据框,然后在“年份”列中添加一个,然后将“收入”列命名为“去年的收入”,如下所示:

testdfCopy <- testdf
testdfCopy$Year <- testdfCopy$Year + 1
colnames(testdfCopy)[4] <- "RevenueLY"
mergeddf <- merge(testdf, testdfCopy, by = c("Year", "Month", "Day", "PartNumber"), all = TRUE)

然后,当我合并它们时,我将第一个数据框的收入和合并的数据框的收入相加,但结果当然不同,因此,我正在寻找一种解决此问题的方法。我的实际数据包含数百万行,因此希望我们能找到一种既不手动也不费时的方法。

> sum(testdf$Revenue)
[1] 1497
> sum(mergeddf$Revenue, na.rm = TRUE)
[1] 1697

最后我得到mergeddf:

> mergeddf
   Year Month Day PartNumber Revenue RevenueLY
1  2016   Aug   1       1234     147        NA
2  2016   Sep   2       5678     200        NA
3  2016   Sep   2       5678     250        NA
4  2017   Aug   1       1234     300       147
5  2017   Sep   2       5678     200       200
6  2017   Sep   2       5678     200       250
7  2018   Aug   1       1234      NA       300
8  2018   Aug   1       5678     250        NA
9  2018   Sep   2       5678      NA       200
10 2018   Sep   2       9101     150        NA
11 2019   Aug   1       5678      NA       250
12 2019   Sep   2       9101      NA       150

但是我想要

> finaldf
  Year Month Day Revenue PartNumber RevenueLY
1 2016   Aug   1     147       1234        NA
2 2016   Sep   2     200       5678        NA
3 2016   Sep   2     250       5678        NA
4 2017   Aug   1     300       1234       147
5 2017   Sep   2     200       5678       200
6 2018   Aug   1     250       5678        NA
7 2018   Sep   2     150       9101        NA

2 个答案:

答案 0 :(得分:0)

这是dplyr可能的选项(为连接表和使用left_join创建索引):

   library(dplyr)
   testdf <- testdf%>%
    mutate(ind=paste0(Year, Month, Day), NextYear= Year+1, ind_next=paste0(NextYear, Month, Day))

    testdf%>%
    left_join(testdf[,c(4,6)], by=c("ind_next"="ind"))

答案 1 :(得分:0)

基于我们在评论中的讨论,我认为您正在寻找这个:

# use data.table
    library(data.table)
    setDT(testdf)

# create an ordernum so that the revenue from the first sale of part A in 
# month M and year Y will be  matched to the first sale of part A in month  
# M and year Y+1  -- as requested by the OP
    testdf[ , ordernum := 1:.N, by=.(Year, Month, PartNumber)]

# use your approach of copy, adjust year, rename-revenue
    testdfCopy <- copy(testdf)
    testdfCopy[ , Year := Year + 1]
    testdfCopy[ , RevenueLY := Revenue]

# merge
    mergeddf <- merge(testdf, 
                  testdfCopy[ , .(Year, Month, ordernum, PartNumber, RevenueLY)], 
                  by=c("Year", "Month", "PartNumber", "ordernum"), 
                  all.x=TRUE)