我一直在努力寻找解决这个问题的最佳方法。
为了概括这个问题并帮助其他可能发现自己需要执行类似任务的人,我试图找到将列添加到第三个数据集的最佳方法,即基于中间数据集中的匹配,AND属于第三数据集的日期范围。最终结果是将第三个数据集中的匹配值返回到第一个数据集。
以下是样本数据框的头部,以增加一点清晰度:
> head(SalesData, 10)
sale_id sale_amt int_rate sale_date sale_status
1 1 7000 10.71 2008-05-01 Fully Paid
2 2 10800 13.57 2009-11-01 Fully Paid
3 3 7500 10.08 2008-04-01 Fully Paid
4 4 3000 14.26 2009-09-01 Fully Paid
5 5 5600 14.96 2010-02-01 Charged Off
6 6 2800 11.49 2010-08-01 Fully Paid
7 7 10000 8.59 2009-10-01 Fully Paid
8 8 18000 10.39 2008-03-01 Fully Paid
9 9 5000 15.13 2008-04-01 Fully Paid
10 10 9600 12.29 2008-03-01 Fully Paid
> head(EmployeeSales, 10)
sale_id empl_name empl_num
1 1 Dakota 4
2 2 Dakota 4
3 3 Kami 9
4 4 Adel 1
5 5 Adel 1
6 6 Farah 6
7 7 Kami 9
8 8 Kami 9
9 9 Ida 7
10 10 Kami 9
> head(EmployeeMap, 10)
empl_num empl_name skill_lvl team start_date end_date
1 1 Adel Beg Red 2007-06-01 2008-05-31
2 1 Adel Int Red 2008-06-01 2010-10-31
3 1 Adel Adv Red 2010-11-01 2999-12-12
4 2 Bailey Beg Blue 2010-08-01 2011-04-30
5 2 Bailey Beg Red 2011-05-01 2999-12-12
6 3 Casey Beg Blue 2010-08-01 2010-12-31
7 3 Casey Int Blue 2011-01-01 2999-12-12
8 4 Dakota Beg Red 2007-06-01 2009-08-30
9 4 Dakota Int Red 2009-09-01 2010-08-30
10 4 Dakota Adv Red 2010-09-01 2011-08-30
所需的输出会将EmployeeMap中的empl_num,sales_team和skill_level添加到每个sale_id的SalesData中。
在尝试概念化步骤时,这就是我的想法,但也许有更好的方法: 从SalesData获取sale_id,将其与Employee Sales中的sale_id匹配并获取empl_num。获取empl_num并将其与Employee Map中的empl_num匹配。现在我们需要从SalesData获取sale_date并找到" start_date,end_date"的范围。它陷入了。然后我们将获取匹配的团队和技能级别,并将其添加到SalesData。
见下表:
> head(df2,10)
sale_id sale_amt int_rate sale_date sale_status empl_num team skill_lvl
1 1 7000 10.71 2008-05-01 Fully Paid 4 Red Beg
2 2 10800 13.57 2009-11-01 Fully Paid 4 Red Int
3 3 7500 10.08 2008-04-01 Fully Paid 9 Blue Beg
4 4 3000 14.26 2009-09-01 Fully Paid 1 Red Int
5 5 5600 14.96 2010-02-01 Charged Off 1 Red Int
6 6 2800 11.49 2010-08-01 Fully Paid 6 Red Beg
7 7 10000 8.59 2009-10-01 Fully Paid 9 Blue Int
8 8 18000 10.39 2008-03-01 Fully Paid 9 Blue Beg
9 9 5000 15.13 2008-04-01 Fully Paid 7 Blue Beg
10 10 9600 12.29 2008-03-01 Fully Paid 9 Blue Int
让我感到困惑的是,在EmployeeMap中,start_date和end_date告诉我们每个员工开始和结束属于特定技能级别和团队的日期。但是每个员工都改变了技能水平和/或团队,因此每个员工都有多行。
例如,在empl_id 1的EmployeeMap中,我们可以看到3行告诉我们他们的start_date和end_date,而他们在Red Team上有一个skill_level Beg,Int,Adv all。但有些人,比如empl_id 2,在保持相同技能水平的同时改变团队。而其他人则会改变技能水平和团队。
如果您有解决此问题的最佳方法,我将不胜感激。
答案 0 :(得分:1)
也许最简单的方法就是使用两个类似SQL的连接(如果你不熟悉连接/关系代数,我建议你给like this一些东西。)
可以使用基础R中的merge
函数执行许多联接,还有许多其他常用软件包(dplyr
,data.table
,sqldf
,仅举几例)联接操作中的替代语法或扩展功能。
您可以使用SalesData
轻松完成两个联接中的第一个(EmployeeSales
和merge
之间):
merge(SalesData, EmployeeSales, by = "sale_id")
# sale_id sale_amt int_rate sale_date sale_status empl_name empl_num
# 1 1 7000 10.71 2008-05-01 Fully Paid Dakota 4
# 2 2 10800 13.57 2009-11-01 Fully Paid Dakota 4
# 3 3 7500 10.08 2008-04-01 Fully Paid Kami 9
# ...
然而,第二次加入更复杂,因为它不是典型的equi-join。相反,连接逻辑需要查找EmployeeMap
中start_date
小于sale_date
且end date
大于empl_num
的行(除{{1}上的相等条件外}})。
幸运的是,前面提到的data.table
包提供了应用所述逻辑的能力。
library(data.table)
# convert all three dataframes to data.table objects
setDT(SalesData) ; setDT(EmployeeSales) ; setDT(EmployeeMap)
EmployeeMap[SalesData[EmployeeSales[, c("sale_id","empl_num")],
on = "sale_id"],
on = .(empl_num, start_date <= sale_date, end_date >= sale_date)]
# empl_num empl_name skill_lvl team start_date end_date sale_id sale_amt int_rate sale_status
# 1: 4 Dakota Beg Red 2008-05-01 2008-05-01 1 7000 10.71 Fully Paid
# 2: 4 Dakota Int Red 2009-11-01 2009-11-01 2 10800 13.57 Fully Paid
# 3: 9 NA NA NA 2008-04-01 2008-04-01 3 7500 10.08 Fully Paid
# ...
请注意,所有三个日期列都应该是日期类型,而不是字符串,以便进行比较。另请注意,上面输出中的NA
值是问题中提供的EmployeeMap
快照的结果,该快照仅映射empl_num
1-4。
我还建议您阅读this question的答案,了解有关如何加入日期范围的更多背景信息。
答案 1 :(得分:0)
考虑按日期运行merge
两次,然后subset
。下面将呼叫嵌在一个长的单行中,但可以在单独的行中分开。此外,由于您发布的数据是样本行,因此输出小于您想要的结果。
# MERGE TWICE AND SUBSET BY DATE
finaldf <- subset(merge(merge(SalesData, EmployeeSales, by="sale_id"),
EmployeeMap, "empl_num", suffixes=c('', '_')),
sale_date >= start_date & sale_date <= end_date)
# SELECT NEEDED COLUMNS
finaldf <- finaldf[c("sale_id", "sale_amt", "int_rate", "sale_date",
"sale_status", "empl_num", "team", "skill_lvl")]
# RE-ORDER BY SALE_ID AND RESET ROW NAMES
finaldf <- with(finaldf, finaldf[order(sale_id),])
row.names(finaldf) <- NULL
finaldf
# sale_id sale_amt int_rate sale_date sale_status empl_num team skill_lvl
# 1 1 7000 10.71 2008-05-01 Fully Paid 4 Red Beg
# 2 2 10800 13.57 2009-11-01 Fully Paid 4 Red Int
# 3 4 3000 14.26 2009-09-01 Fully Paid 1 Red Int
# 4 5 5600 14.96 2010-02-01 Charged Off 1 Red Int
答案 2 :(得分:0)
在SQL术语中,这是一个3向连接。它可以在单个SQL选择中完成,如下所示:
library(sqldf)
sqldf("
select s.*, es.empl_num, em.team, em.skill_lvl
from SalesData s
left join EmployeeSales es
using (sale_id)
left join EmployeeMap em
on es.empl_num = em.empl_num and s.sale_date between em.start_date and em.end_date
")
最后使用Note中的数据(基于所讨论的数据),我们得到以下结果。问题中显示的EmployeeMap数据中只存在前4个员工编号,左连接确保我们获得团队的NA值和其他人的技能级别,而不是由于不匹配而丢弃的SalesData行。
sale_id sale_amt int_rate sale_date sale_status empl_num team skill_lvl
1 1 7000 10.71 2008-05-01 Fully Paid 4 Red Beg
2 2 10800 13.57 2009-11-01 Fully Paid 4 Red Int
3 3 7500 10.08 2008-04-01 Fully Paid 9 <NA> <NA>
4 4 3000 14.26 2009-09-01 Fully Paid 1 Red Int
5 5 5600 14.96 2010-02-01 Charged Off 1 Red Int
6 6 2800 11.49 2010-08-01 Fully Paid 6 <NA> <NA>
7 7 10000 8.59 2009-10-01 Fully Paid 9 <NA> <NA>
8 8 18000 10.39 2008-03-01 Fully Paid 9 <NA> <NA>
9 9 5000 15.13 2008-04-01 Fully Paid 7 <NA> <NA>
10 10 9600 12.29 2008-03-01 Fully Paid 9 <NA> <NA>
以可重复的形式输入数据:
SalesData <- structure(list(sale_id = 1:10, sale_amt = c(7000L, 10800L, 7500L,
3000L, 5600L, 2800L, 10000L, 18000L, 5000L, 9600L), int_rate = c(10.71,
13.57, 10.08, 14.26, 14.96, 11.49, 8.59, 10.39, 15.13, 12.29),
sale_date = structure(c(3L, 6L, 2L, 4L, 7L, 8L, 5L, 1L, 2L,
1L), .Label = c("2008-03-01", "2008-04-01", "2008-05-01",
"2009-09-01", "2009-10-01", "2009-11-01", "2010-02-01", "2010-08-01"
), class = "factor"), sale_status = structure(c(2L, 2L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("Charged Off", "Fully Paid"
), class = "factor")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
EmployeeSales <-
structure(list(sale_id = 1:10, empl_name = structure(c(2L, 2L,
5L, 1L, 1L, 3L, 5L, 5L, 4L, 5L), .Label = c("Adel", "Dakota",
"Farah", "Ida", "Kami"), class = "factor"), empl_num = c(4L,
4L, 9L, 1L, 1L, 6L, 9L, 9L, 7L, 9L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
EmployeeMap <- structure(list(empl_num = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L,
4L), empl_name = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,
4L, 4L), .Label = c("Adel", "Bailey", "Casey", "Dakota"), class = "factor"),
skill_lvl = structure(c(2L, 3L, 1L, 2L, 2L, 2L, 3L, 2L, 3L,
1L), .Label = c("Adv", "Beg", "Int"), class = "factor"),
team = structure(c(2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L
), .Label = c("Blue", "Red"), class = "factor"), start_date = structure(c(1L,
2L, 6L, 4L, 8L, 4L, 7L, 1L, 3L, 5L), .Label = c("2007-06-01",
"2008-06-01", "2009-09-01", "2010-08-01", "2010-09-01", "2010-11-01",
"2011-01-01", "2011-05-01"), class = "factor"), end_date = structure(c(1L,
4L, 8L, 6L, 8L, 5L, 8L, 2L, 3L, 7L), .Label = c("2008-05-31",
"2009-08-30", "2010-08-30", "2010-10-31", "2010-12-31", "2011-04-30",
"2011-08-30", "2999-12-12"), class = "factor")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))