假设我有一个R数据框,如下所示:
#sample data frame
df <- data.frame(
customer_id = c(568468,568468,568468,568468,568468,568468),
customer = c('paramount','paramount','paramount','paramount','paramount','paramount'),
start_date = as.Date(c('2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15')),
occured_on = as.POSIXct(c('2017-08-08 20:05:00','2017-08-08 20:30:00','2017-08-11 21:13:00','2017-08-11 21:30:00','2017-08-31 05:16:00','2017-08-31 05:30:00')),
old_plan = c('a',NA,'b',NA,'b',NA),
old_price = c(NA,29,NA,99,NA,82.5),
old_recurrence = c('monthly',NA,'monthly',NA,'annually',NA),
new_plan = c('b',NA,'b',NA,'c',NA),
new_price = c(NA,99,NA,82.5,NA,349),
new_recurrence = c('monthly',NA,'annually',NA,'monthly',NA)
);
任务:
根据min occured_on时间将old_plan,old_price,old_recurrence排名为每组中的第一个... 和new_plan,new_price,new_recurrence,基于max occured_on时间...... 这样我的结果数据框将具有第一个旧计划,价格和重复,以及最后的新计划价格和重复。 应删除/不考虑NA。生成的数据框应如下所示:
customer_id customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
568468 paramount 2016-03-15 a 29 monthly c 349 monthly
或者如果您想查看代码
result_df <- data.frame(
customer_id = 568468,
customer = 'paramount',
start_date = "2016-03-15",
old_plan = 'a',
old_price = 29,
old_recurrence = 'monthly',
new_plan = 'c',
new_price = 349,
new_recurrence = 'monthly'
)
我觉得我很接近使用这些功能...
df$old_plan_rank <- rank(df$old_plan, na.last = "keep", ties.method = "min")
df$new_recurrence_rank <- rank(df$new_recurrence, na.last = "keep", ties.method = "max")
除了基于订单或按字母/数字排序,而不是基于happen_on列实际发生的顺序。我不知道如何指定要排名的列。
帮助?
答案 0 :(得分:1)
使用dplyr
的解决方案。
library(dplyr)
df2 <- df %>%
arrange(customer_id, start_date, occured_on) %>%
group_by(customer_id, customer, start_date) %>%
summarise(old_plan = first(old_plan[!is.na(old_plan)]),
old_price = first(old_price[!is.na(old_price)]),
old_recurrence = first(old_recurrence[!is.na(old_recurrence)]),
new_plan = last(new_plan[!is.na(new_plan)]),
new_price = last(new_price[!is.na(new_price)]),
new_recurrence = last(new_recurrence[!is.na(new_recurrence)])) %>%
ungroup() %>%
as.data.frame()
df2
# customer_id customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
# 1 568468 paramount 2016-03-15 a 29 monthly c 349 monthly
<强>解释强>
arrange(customer_id, start_date, occured_on)
用于对列进行排序。它按customer_id
排序,然后start_date
,最后occured_on
。
group_by(customer_id, customer, start_date)
表示根据customer_id
,customer
和start_date
在每个组中执行以下操作。
summarise
为每个变量生成单个摘要输出。
对于每个变量,以old_plan
为例,我使用old_plan[!is.na(old_plan)
来提取该列的非NA值。之后,first
和last
可以提取这些值的第一个或最后一个元素,这些元素对应于时间上的最小值和最大值。
ungroup()
将删除分组。 as.data.frame()
是可选的,可将tibble
对象转换为严格的data.frame
对象。