我有两个数据帧:df_workingFile和df_groupIDs
df_workingFile:
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006
df_groupIDs:
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2
对于df_groupIDs,我想获得该组中最大销售额的事件的ID和日期。所以组" a1"在df_workingFile中有2个事件," v"和" w"。我想确定那个事件" w"具有Max销售额并将其信息输入df_groupIDs。最终输出应如下所示:
GroupID | numIDs | MaxSales | ID | Date
a1 | 2 | 3 | w | 2010
b1 | 2 | 8 | x | 2007
c3 | 1 | 2 | z | 2006
现在问题。我编写了这样做的代码,但是当我处理50-100K行的数据集时,它的效率非常低并且需要永久处理。我需要帮助找出如何重写我的代码以提高效率。这就是我目前所拥有的:
i = 1
for (groupID in df_groupIDs$groupID) {
groupEvents <- subset(df_workingFile, df_workingFile$groupID == groupID)
index <- match(df_groupIDs$maxSales[i], groupEvents$Sales)
df_groupIDs$ID[i] = groupEvents$ID[index]
df_groupIDs$Date[i] = groupEvents$Date[index]
i = i+1
}
答案 0 :(得分:4)
使用dplyr
:
library(dplyr)
df_workingFile %>%
group_by(GroupID) %>% # for each group id
arrange(desc(Sales)) %>% # sort by Sales (descending)
slice(1) %>% # keep the top row
inner_join(df_groupIDs) # join to df_groupIDs
select(GroupID, numIDs, MaxSales, ID, Date)
# keep the columns you want in the order you want
另一个更简单的方法, if Sales
是整数(因此可以依赖于MaxSales
列的相等性测试):
inner_join(df_groupIDs, df_workingFile,
by = c("GroupID" = "GroupID", "MaxSales" = "Sales"))
答案 1 :(得分:1)
这使用了一个SQLite的功能,如果在行上使用max,那么它会自动带来最大值来自的行。
library(sqldf)
sqldf("select g.GroupID, g.numIDs, max(w.Sales) MaxSales, w.ID, w.Date
from df_groupIDs g left join df_workingFile w using(GroupID)
group by GroupID")
,并提供:
GroupID numIDs MaxSales ID Date
1 a1 2 3 w 2010
2 b1 2 8 x 2007
3 c3 1 2 z 2006
注意:重复显示的两个输入数据框为:
Lines1 <- "
ID | GroupID | Sales | Date
v | a1 | 1 | 2011
w | a1 | 3 | 2010
x | b1 | 8 | 2007
y | b1 | 3 | 2006
z | c3 | 2 | 2006"
df_workingFile <- read.table(text = Lines1, header = TRUE, sep = "|", strip.white = TRUE)
Lines2 <- "
GroupID | numIDs | MaxSales
a1 | 2 | 3
b1 | 2 | 8
c3 | 1 | 2"
df_groupIDs <- read.table(text = Lines2, header = TRUE, sep = "|", strip.white = TRUE)