在R中进行分组操作

时间:2018-06-27 14:14:33

标签: r

我有一个数据,其中在表sf-> Customer id和Buy_date中有2个字段。 Buy_date是唯一的,但对于每个客户而言,但每个客户可以有3个以上不同的Buy_dates值。我想计算每个Buy_date在连续Customer中的差及其平均值。我该怎么办。

示例

Customer   Buy_date
1          2018/03/01
1          2018/03/19
1          2018/04/3
1          2018/05/10
2          2018/01/02
2          2018/02/10
2          2018/04/13

我希望每个客户的结果格式

Customer  mean

2 个答案:

答案 0 :(得分:0)

这是一个dplyr解决方案。

您的数据:

df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))

分组,平均Buy_date 的计算和汇总:

library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()

输出:

# A tibble: 2 x 2
# Groups:   Customer [?]
  Customer mean               
     <dbl> <dttm>             
1        1 2018-03-31 06:30:00
2        2 2018-02-17 15:40:00

或者正如@ r2evans在Buy_date s 之间的连续天中的评论中指出的那样:

df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()

输出:

# A tibble: 2 x 2
# Groups:   Customer [?]
  Customer mean            
     <dbl> <time>          
1        1 23.3194444444444
2        2 50.4791666666667

答案 1 :(得分:0)

我不确定所需的输出,但是我想这就是您想要的。

(.*)            # One or more character (as capture group 1)
    \n          # a new-line
      [^\n]     # followed by one or more non new-lines
           $    # at the end of the String

$1              # Replace it with the capture group 1 substring
                # (so the last new-line, and everything after it are removed)

这将产生:

library(dplyr)
library(zoo)
dat <- read.table(text = 
"Customer   Buy_date
1          2018/03/01
1          2018/03/19
1          2018/04/3
1          2018/05/10
2          2018/01/02
2          2018/02/10
2          2018/04/13", header = T, stringsAsFactors = F)


dat$Buy_date <- as.Date(dat$Buy_date)

dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)), 
                                      mean_days = mean(diff_between, na.rm = TRUE))

根据用户评论进行编辑:

因为您说您有因素,但没有字符,只需执行以下操作即可将其转换:

    Customer Buy_date   diff_between mean_days
     <int> <date>            <dbl>     <dbl>
1        1 2018-03-01           NA      23.3
2        1 2018-03-19           18      23.3
3        1 2018-04-03           15      23.3
4        1 2018-05-10           37      23.3
5        2 2018-01-02           NA      50.5
6        2 2018-02-10           39      50.5
7        2 2018-04-13           62      50.5