我有一个类似于df1的df,其中我要打破行,以便Hrs_Time_Worked列的间隔为4,如df2所示。
我一直在使用以下代码,但它会抛出错误:
df2 = df1 %>%
group_by(Row)%>%
mutate(S=START_DATE_TIME,
Hrs_Time_Worked=list((n<-c(rep(4,Hrs_Time_Worked%/%4),Hrs_Time_Worked%%4))[n!=0]))%>%
unnest()%>%
mutate(E=START_DATE_TIME+hours(cumsum(Hrs_Time_Worked)),
S=E-hours(unlist(Hrs_Time_Worked)),
START_DATE_TIME=(S),
END_DATE_TIME=(E),
S=NULL,E=NULL)
mutate_impl(.data,dots)中的错误:评估错误:无效的类 期间对象:期间必须具有整数值。
以下是必需的:
所有分类数据在子行上必须保持相同(例如,TIME_RPTG_CD 在每个子行上保持不变)
如果有余数 少于四个,剩余金额应列在最后一个 line(例如,df2;第3行)
如果子行在下一行开始或结束 date应该相应地更新日期列(例如,df2;第2-3行)
df1(当前)
Row EMPLID TIME_RPTG_CD START_DATE_TIME END_DATE_TIME Hrs_Time_Worked
<chr> <chr> <dttm> <dttm> <dbl>
1 X00007 REG 2014-07-03 16:00:00 2014-07-03 02:00:00 10.0
df2(所需)
Row EMPLID TIME_RPTG_CD START_DATE_TIME END_DATE_TIME Hrs_Time_Worked
<chr> <chr> <dttm> <dttm> <dbl>
1 X00007 REG 2014-07-03 16:00:00 2014-07-03 20:00:00 4.0
2 X00007 REG 2014-07-03 20:00:00 2014-07-04 24:00:00 4.0
3 X00007 REG 2014-07-04 24:00:00 2014-07-04 02:00:00 2.0
答案 0 :(得分:1)
其中一种方法可能是
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
rowwise() %>%
mutate(START_DATE_TIME = paste(seq.POSIXt(START_DATE_TIME, END_DATE_TIME, by = "4 hour"), collapse = ",")) %>%
separate_rows(START_DATE_TIME, sep = ",") %>%
group_by(Row) %>%
mutate(END_DATE_TIME = ymd_hms(lead(START_DATE_TIME, order_by = Row, default = as.character(END_DATE_TIME))),
START_DATE_TIME = ymd_hms(START_DATE_TIME),
Hrs_Time_Worked = as.numeric(difftime(END_DATE_TIME, START_DATE_TIME, units = "hour"))) %>%
filter(Hrs_Time_Worked > 0)
给出了
Row EMPLID TIME_RPTG_CD START_DATE_TIME END_DATE_TIME Hrs_Time_Worked
1 1 X00007 REG 2014-07-03 16:00:00 2014-07-03 20:00:00 4.00
2 1 X00007 REG 2014-07-03 20:00:00 2014-07-04 00:00:00 4.00
3 1 X00007 REG 2014-07-04 00:00:00 2014-07-04 02:00:00 2.00
示例数据:
df <- structure(list(Row = 1L, EMPLID = "X00007", TIME_RPTG_CD = "REG",
START_DATE_TIME = structure(1404403200, tzone = "UTC", class = c("POSIXct",
"POSIXt")), END_DATE_TIME = structure(1404439200, tzone = "UTC", class = c("POSIXct",
"POSIXt")), Hrs_Time_Worked = 10), .Names = c("Row", "EMPLID",
"TIME_RPTG_CD", "START_DATE_TIME", "END_DATE_TIME", "Hrs_Time_Worked"
), row.names = c(NA, -1L), class = "data.frame")
# Row EMPLID TIME_RPTG_CD START_DATE_TIME END_DATE_TIME Hrs_Time_Worked
#1 1 X00007 REG 2014-07-03 16:00:00 2014-07-04 02:00:00 10
答案 1 :(得分:0)
与@ Prem相似,但使用列表列和unnest
:
df %>%
rowwise %>%
mutate(START_DATE_TIME = list(seq.POSIXt(START_DATE_TIME, END_DATE_TIME, by = "4 hour")),
END_DATE_TIME = list(c(tail(START_DATE_TIME,-1),END_DATE_TIME))) %>%
unnest %>%
mutate(Hrs_Time_Worked = difftime(END_DATE_TIME,START_DATE_TIME, "hours"))
# # A tibble: 3 x 6
# Row EMPLID TIME_RPTG_CD Hrs_Time_Worked START_DATE_TIME END_DATE_TIME
# <int> <chr> <chr> <time> <dttm> <dttm>
# 1 1 X00007 REG 4 2014-07-03 16:00:00 2014-07-03 20:00:00
# 2 1 X00007 REG 4 2014-07-03 20:00:00 2014-07-04 00:00:00
# 3 1 X00007 REG 2 2014-07-04 00:00:00 2014-07-04 02:00:00
使用map
比使用rowwise
效率更高,虽然我认为不太可读,但使用地图可以做到这一点:
df %>%
mutate(START_DATE_TIME = map(START_DATE_TIME,~seq.POSIXt(., END_DATE_TIME, by = "4 hour")),
END_DATE_TIME = map2(END_DATE_TIME,START_DATE_TIME,~c(tail(.y,-1),.x))) %>%
unnest %>%
mutate(Hrs_Time_Worked = difftime(END_DATE_TIME,START_DATE_TIME, "hours"))
# Row EMPLID TIME_RPTG_CD Hrs_Time_Worked START_DATE_TIME END_DATE_TIME
# 1 1 X00007 REG 4 hours 2014-07-03 16:00:00 2014-07-03 20:00:00
# 2 1 X00007 REG 4 hours 2014-07-03 20:00:00 2014-07-04 00:00:00
# 3 1 X00007 REG 2 hours 2014-07-04 00:00:00 2014-07-04 02:00:00
在这种情况下,输出不是tibble
,而是标准data.frame
,这解释了Hrs_Time_Worked
列以不同方式打印的原因。使用as_tibble
获取相同的输出。或者在任何解决方案上使用as.numeric
将其设为double
。