Question

我有两个数据帧：df_workingFile和df_groupIDs

df_workingFile：

ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006

df_groupIDs：

GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2

对于df_groupIDs，我想获得该组中最大销售额的事件的ID和日期。所以组＆＃34; a1＆＃34;在df_workingFile中有2个事件，＆＃34; v＆＃34;和＆＃34; w＆＃34;。我想确定那个事件＆＃34; w＆＃34;具有Max销售额并将其信息输入df_groupIDs。最终输出应如下所示：

GroupID | numIDs  | MaxSales | ID | Date
a1      | 2       |  3       | w  | 2010
b1      | 2       |  8       | x  | 2007
c3      | 1       |  2       | z  | 2006

现在问题。我编写了这样做的代码，但是当我处理50-100K行的数据集时，它的效率非常低并且需要永久处理。我需要帮助找出如何重写我的代码以提高效率。这就是我目前所拥有的：

i = 1
for (groupID in df_groupIDs$groupID) {

    groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
    index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
    df_groupIDs$ID[i] = groupEvents$ID[index]
    df_groupIDs$Date[i] = groupEvents$Date[index]

    i = i+1
}

Answer 1

使用dplyr：

library(dplyr)

df_workingFile %>% 
  group_by(GroupID) %>%      # for each group id
  arrange(desc(Sales)) %>%   # sort by Sales (descending)
  slice(1) %>%               # keep the top row
  inner_join(df_groupIDs)    # join to df_groupIDs
  select(GroupID, numIDs, MaxSales, ID, Date)
    # keep the columns you want in the order you want

另一个更简单的方法， if Sales是整数（因此可以依赖于MaxSales列的相等性测试）：

inner_join(df_groupIDs, df_workingFile,
           by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))

Answer 2

这使用了一个SQLite的功能，如果在行上使用max，那么它会自动带来最大值来自的行。

library(sqldf)

sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date 
       from df_groupIDs g left join df_workingFile w using(GroupID) 
       group by GroupID")

，并提供：

  GroupID numIDs MaxSales ID Date
1      a1      2        3  w 2010
2      b1      2        8  x 2007
3      c3      1        2  z 2006

注意：重复显示的两个输入数据框为：

Lines1 <- "
ID | GroupID | Sales | Date
v  | a1      |  1    |  2011
w  | a1      |  3    |  2010
x  | b1      |  8    |  2007
y  | b1      |  3    |  2006
z  | c3      |  2    |  2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)

Lines2 <- "
GroupID | numIDs  | MaxSales 
a1      | 2       |  3       
b1      | 2       |  8       
c3      | 1       |  2"      

df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)

在两个数据帧之间快速匹配数据[R]

2 个答案: