在两个数据帧之间快速匹配数据[R]

时间:2017-08-04 23:41:39

标签: r

我有两个数据帧:df_workingFile和df_groupIDs

df_workingFile:

ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006

df_groupIDs:

GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2      

对于df_groupIDs,我想获得该组中最大销售额的事件的ID和日期。所以组" a1"在df_workingFile中有2个事件," v"和" w"。我想确定那个事件" w"具有Max销售额并将其信息输入df_groupIDs。最终输出应如下所示:

GroupID | numIDs  | MaxSales | ID | Date
a1      | 2       |  3       | w  | 2010
b1      | 2       |  8       | x  | 2007
c3      | 1       |  2       | z  | 2006

现在问题。我编写了这样做的代码,但是当我处理50-100K行的数据集时,它的效率非常低并且需要永久处理。我需要帮助找出如何重写我的代码以提高效率。这就是我目前所拥有的:

i = 1
for (groupID in df_groupIDs$groupID) {

    groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
    index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
    df_groupIDs$ID[i] = groupEvents$ID[index]
    df_groupIDs$Date[i] = groupEvents$Date[index]

    i = i+1
}

2 个答案:

答案 0 :(得分:4)

使用dplyr

library(dplyr)

df_workingFile %>% 
  group_by(GroupID) %>%      # for each group id
  arrange(desc(Sales)) %>%   # sort by Sales (descending)
  slice(1) %>%               # keep the top row
  inner_join(df_groupIDs)    # join to df_groupIDs
  select(GroupID, numIDs, MaxSales, ID, Date)
    # keep the columns you want in the order you want

另一个更简单的方法, if Sales是整数(因此可以依赖于MaxSales列的相等性测试):

inner_join(df_groupIDs, df_workingFile,
           by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))

答案 1 :(得分:1)

这使用了一个SQLite的功能,如果在行上使用max,那么它会自动带来最大值来自的行。

library(sqldf)

sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date 
       from df_groupIDs g left join df_workingFile w using(GroupID) 
       group by GroupID")

,并提供:

  GroupID numIDs MaxSales ID Date
1      a1      2        3  w 2010
2      b1      2        8  x 2007
3      c3      1        2  z 2006

注意:重复显示的两个输入数据框为:

Lines1 <- "
ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)

Lines2 <- "
GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2"      

df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)