我有一个如下所示的数据集:
id Type Sale SaleDate Time Cat LoadType LoadDate
A11 ABC 123 15/11/2016 00:00 AAA Unload 23/11/2016
A11 ABC 123 15/11/2016 00:00 AAA Load 17/11/2016
A556 ABC 444 09/01/2017 00:00 VVV Unload 17/01/2017
A556 ABC 444 09/01/2017 00:00 VVV Load 17/01/2017
我想为每个id获取LoadDate之间的区别。例如,它应该返回
id .... LoadDate DifferenceInDays
A11 .... 23/11/2016 6
A11 .... 17/11/2016 6
对于具有相同ID的两行,DifferenceInDays应该相同。
答案 0 :(得分:2)
您可以按id
分组,然后计算max(LoadDate)
- min(LoadDate)
。假设您的数据框名为myData
:
library(dplyr)
myData %>%
mutate(SaleDate = as.Date(SaleDate, "%d/%m/%Y"),
LoadDate = as.Date(LoadDate, "%d/%m/%Y")) %>%
group_by(id) %>%
summarise(DifferenceInDays = max(LoadDate) - min(LoadDate))
结果:
id DifferenceInDays
<chr> <time>
1 A11 6 days
2 A556 0 days
如果要将列添加到原始数据框,请使用mutate()
代替summarise()
。
答案 1 :(得分:1)
我会用data.table
:
require('data.table')
# Your example data, in a data.frame
df = read.table(text='id Type Sale SaleDate Time Cat LoadType LoadDate
A11 ABC 123 15/11/2016 00:00 AAA Unload 23/11/2016
A11 ABC 123 15/11/2016 00:00 AAA Load 17/11/2016
A556 ABC 444 09/01/2017 00:00 VVV Unload 17/01/2017
A556 ABC 444 09/01/2017 00:00 VVV Load 17/01/2017', header=T)
# convert to a data.table...
dt = data.table(df, key='id')
# ... with the right format for the date
dt[, LoadDate := as.IDate(LoadDate, format='%d/%m/%Y')]
# computes the difference in days, by ID:
dt[, DifferenceInDays := diff(range(LoadDate)), by=id]
这给出了所需的输出:
> dt
id Type Sale SaleDate Time Cat LoadType LoadDate DifferenceInDays
1: A11 ABC 123 15/11/2016 00:00 AAA Unload 2016-11-23 6
2: A11 ABC 123 15/11/2016 00:00 AAA Load 2016-11-17 6
3: A556 ABC 444 09/01/2017 00:00 VVV Unload 2017-01-17 0
4: A556 ABC 444 09/01/2017 00:00 VVV Load 2017-01-17 0