如何根据另一列对r数据框中的列进行排名

时间:2017-11-18 13:59:12

标签: r dataframe rank

假设我有一个R数据框,如下所示:

#sample data frame
df <- data.frame(
customer_id = c(568468,568468,568468,568468,568468,568468),
customer = c('paramount','paramount','paramount','paramount','paramount','paramount'),
start_date = as.Date(c('2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15')),
occured_on = as.POSIXct(c('2017-08-08 20:05:00','2017-08-08 20:30:00','2017-08-11 21:13:00','2017-08-11 21:30:00','2017-08-31 05:16:00','2017-08-31 05:30:00')),
old_plan = c('a',NA,'b',NA,'b',NA),
old_price = c(NA,29,NA,99,NA,82.5),
old_recurrence = c('monthly',NA,'monthly',NA,'annually',NA),
new_plan = c('b',NA,'b',NA,'c',NA),
new_price = c(NA,99,NA,82.5,NA,349),
new_recurrence = c('monthly',NA,'annually',NA,'monthly',NA)
);

任务:

根据min occured_on时间将old_plan,old_price,old_recurrence排名为每组中的第一个... 和new_plan,new_price,new_recurrence,基于max occured_on时间...... 这样我的结果数据框将具有第一个旧计划,价格和重复,以及最后的新计划价格和重复。 应删除/不考虑NA。生成的数据框应如下所示:

customer_id  customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
568468 paramount 2016-03-15        a        29        monthly        c       349        monthly

或者如果您想查看代码

result_df <- data.frame(
customer_id = 568468,
customer = 'paramount',
start_date = "2016-03-15",
old_plan = 'a',
old_price = 29,
old_recurrence = 'monthly',
new_plan = 'c',
new_price = 349,
new_recurrence = 'monthly'
)

我觉得我很接近使用这些功能...

df$old_plan_rank <- rank(df$old_plan, na.last = "keep", ties.method = "min")
df$new_recurrence_rank <- rank(df$new_recurrence, na.last = "keep", ties.method = "max")

除了基于订单或按字母/数字排序,而不是基于happen_on列实际发生的顺序。我不知道如何指定要排名的列。

帮助?

1 个答案:

答案 0 :(得分:1)

使用dplyr的解决方案。

library(dplyr)

df2 <- df %>%
  arrange(customer_id, start_date, occured_on) %>%
  group_by(customer_id, customer, start_date) %>%
  summarise(old_plan = first(old_plan[!is.na(old_plan)]),
            old_price = first(old_price[!is.na(old_price)]),
            old_recurrence = first(old_recurrence[!is.na(old_recurrence)]),
            new_plan = last(new_plan[!is.na(new_plan)]),
            new_price = last(new_price[!is.na(new_price)]),
            new_recurrence = last(new_recurrence[!is.na(new_recurrence)])) %>%
  ungroup() %>%
  as.data.frame()
df2
#   customer_id  customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
# 1      568468 paramount 2016-03-15        a        29        monthly        c       349        monthly

<强>解释

arrange(customer_id, start_date, occured_on)用于对列进行排序。它按customer_id排序,然后start_date,最后occured_on

group_by(customer_id, customer, start_date)表示根据customer_idcustomerstart_date在每个组中执行以下操作。

summarise为每个变量生成单个摘要输出。

对于每个变量,以old_plan为例,我使用old_plan[!is.na(old_plan)来提取该列的非NA值。之后,firstlast可以提取这些值的第一个或最后一个元素,这些元素对应于时间上的最小值和最大值。

ungroup()将删除分组。 as.data.frame()是可选的,可将tibble对象转换为严格的data.frame对象。