计算一天中第一次失败后的所有观察结果

时间:2018-10-16 12:40:18

标签: r

以下是我的数据示例

Customer_id  Seller_id   transaction_date   transaction_status
    1          7           2010-03-10         SUCCESS
    1          7           2010-03-10         FAIL
    1          7           2010-03-10         SUCCESS
    1          8           2010-03-10         SUCCESS
    2          9           2010-03-15         FAIL
    2          9           2010-03-15         FAIL
    3          10          2010-03-15         SUCCESS
    3          10          2010-03-15         FAIL
    3          10          2010-03-15         SUCCESS
    4          10          2010-03-15         FAIL
    4          10          2010-03-15         FAIL
    4          11          2010-03-15         SUCCESS
    5          12          2010-03-20         FAIL

我正在寻找在给定日期首次出现交易失败后的客户和卖方之间的交易次数。

我对R非常熟悉,但是从未在其中编写任何循环。 任何帮助都会很棒。

3 个答案:

答案 0 :(得分:2)

示例数据

df = read.table(text = "
Customer_id  Seller_id   transaction_date   transaction_status
1          7           2010-03-10         SUCCESS
1          7           2010-03-10         FAIL
1          7           2010-03-10         SUCCESS
1          8           2010-03-10         SUCCESS
2          9           2010-03-15         FAIL
2          9           2010-03-15         FAIL
3          10          2010-03-15         SUCCESS
3          10          2010-03-15         FAIL
3          10          2010-03-15         SUCCESS
4          10          2010-03-15         FAIL
4          10          2010-03-15         FAIL
4          11          2010-03-15         SUCCESS
5          12          2010-03-20         FAIL
", header=T)

解决方案(使用dplyr而不是for循环)

library(dplyr)

df %>%
  group_by(Customer_id, Seller_id, transaction_date) %>%
  mutate(transaction_count = sum((transaction_status == "SUCCESS" & cumsum(transaction_status == "FAIL") > 0) |
                                 (transaction_status == "FAIL" & cumsum(transaction_status == "FAIL") > 1))) %>%
  ungroup()

# # A tibble: 13 x 5
#   Customer_id Seller_id transaction_date transaction_status transaction_count
#         <int>     <int> <fct>            <fct>                          <int>
# 1           1         7 2010-03-10       SUCCESS                            1
# 2           1         7 2010-03-10       FAIL                               1
# 3           1         7 2010-03-10       SUCCESS                            1
# 4           1         8 2010-03-10       SUCCESS                            0
# 5           2         9 2010-03-15       FAIL                               1
# 6           2         9 2010-03-15       FAIL                               1
# 7           3        10 2010-03-15       SUCCESS                            1
# 8           3        10 2010-03-15       FAIL                               1
# 9           3        10 2010-03-15       SUCCESS                            1
#10           4        10 2010-03-15       FAIL                               1
#11           4        10 2010-03-15       FAIL                               1
#12           4        11 2010-03-15       SUCCESS                            0
#13           5        12 2010-03-20       FAIL                               0

注意,上述解决方案保留的行数与原始数据集相同。

如果您在上一个代码的末尾添加... %>% distinct(Customer_id, Seller_id, transaction_date, transaction_count),则会得到不同的组合,例如:

# # A tibble: 7 x 4
#   Customer_id Seller_id transaction_date transaction_count
#         <int>     <int> <fct>                        <int>
# 1           1         7 2010-03-10                       1
# 2           1         8 2010-03-10                       0
# 3           2         9 2010-03-15                       1
# 4           3        10 2010-03-15                       1
# 5           4        10 2010-03-15                       1
# 6           4        11 2010-03-15                       0
# 7           5        12 2010-03-20                       0

答案 1 :(得分:2)

使用dplyrtidyr的解决方案:

df %>% group_by(Customer_id,Seller_id, transaction_date) %>% 
  mutate(postfail=lag(transaction_status=="FAIL")) %>% 
  fill(postfail) %>% summarize(postfail_transactions=sum(postfail, na.rm=TRUE))

结果:

# A tibble: 7 x 3
# Groups:   Customer_id [?]
  Customer_id Seller_id postfail_transactions
        <int>     <int>                 <int>
1           1         7                     1
2           1         8                     0
3           2         9                     1
4           3        10                     1
5           4        10                     1
6           4        11                     0
7           5        12                     0

请注意,在第一次失败(包括更多失败)之后,我会统计任何个交易。

如果您希望将信息作为同一日期框架中的另一列,请删除最后一步。

答案 2 :(得分:2)

这里是第一次失败count_success之后所有成功的计数,以及第一次失败count_all之后所有的事务

.N是当前组中的行数,which.max(fail)给出了failTRUE的第一个匹配项,因此.N - which.max(fail)给出了第一次失败后的行数(交易)。

如果行号TRUE大于第一次失败的行号seq(.N),并且失败,则成功次数被计算为逻辑向量之和which.max(fail)FALSE,即!fail

library(data.table)
setDT(df)

# Number of Successful transactions after first failure
df[, {fail <- transaction_status == 'FAIL'
      .(count_all      = .N - which.max(fail), 
        count_success  = sum(!fail & (seq(.N) > which.max(fail))))}
   , by = .(Customer_id, Seller_id, transaction_date)]

#    Customer_id Seller_id transaction_date count_all count_success
# 1:           1         7       2010-03-10         1             1
# 2:           1         8       2010-03-10         0             0
# 3:           2         9       2010-03-15         1             0
# 4:           3        10       2010-03-15         1             1
# 5:           4        10       2010-03-15         1             0
# 6:           4        11       2010-03-15         0             0
# 7:           5        12       2010-03-20         0             0