以下是我的数据示例
Customer_id Seller_id transaction_date transaction_status
1 7 2010-03-10 SUCCESS
1 7 2010-03-10 FAIL
1 7 2010-03-10 SUCCESS
1 8 2010-03-10 SUCCESS
2 9 2010-03-15 FAIL
2 9 2010-03-15 FAIL
3 10 2010-03-15 SUCCESS
3 10 2010-03-15 FAIL
3 10 2010-03-15 SUCCESS
4 10 2010-03-15 FAIL
4 10 2010-03-15 FAIL
4 11 2010-03-15 SUCCESS
5 12 2010-03-20 FAIL
我正在寻找在给定日期首次出现交易失败后的客户和卖方之间的交易次数。
我对R非常熟悉,但是从未在其中编写任何循环。 任何帮助都会很棒。
答案 0 :(得分:2)
示例数据
df = read.table(text = "
Customer_id Seller_id transaction_date transaction_status
1 7 2010-03-10 SUCCESS
1 7 2010-03-10 FAIL
1 7 2010-03-10 SUCCESS
1 8 2010-03-10 SUCCESS
2 9 2010-03-15 FAIL
2 9 2010-03-15 FAIL
3 10 2010-03-15 SUCCESS
3 10 2010-03-15 FAIL
3 10 2010-03-15 SUCCESS
4 10 2010-03-15 FAIL
4 10 2010-03-15 FAIL
4 11 2010-03-15 SUCCESS
5 12 2010-03-20 FAIL
", header=T)
解决方案(使用dplyr
而不是for
循环)
library(dplyr)
df %>%
group_by(Customer_id, Seller_id, transaction_date) %>%
mutate(transaction_count = sum((transaction_status == "SUCCESS" & cumsum(transaction_status == "FAIL") > 0) |
(transaction_status == "FAIL" & cumsum(transaction_status == "FAIL") > 1))) %>%
ungroup()
# # A tibble: 13 x 5
# Customer_id Seller_id transaction_date transaction_status transaction_count
# <int> <int> <fct> <fct> <int>
# 1 1 7 2010-03-10 SUCCESS 1
# 2 1 7 2010-03-10 FAIL 1
# 3 1 7 2010-03-10 SUCCESS 1
# 4 1 8 2010-03-10 SUCCESS 0
# 5 2 9 2010-03-15 FAIL 1
# 6 2 9 2010-03-15 FAIL 1
# 7 3 10 2010-03-15 SUCCESS 1
# 8 3 10 2010-03-15 FAIL 1
# 9 3 10 2010-03-15 SUCCESS 1
#10 4 10 2010-03-15 FAIL 1
#11 4 10 2010-03-15 FAIL 1
#12 4 11 2010-03-15 SUCCESS 0
#13 5 12 2010-03-20 FAIL 0
注意,上述解决方案保留的行数与原始数据集相同。
如果您在上一个代码的末尾添加... %>% distinct(Customer_id, Seller_id, transaction_date, transaction_count)
,则会得到不同的组合,例如:
# # A tibble: 7 x 4
# Customer_id Seller_id transaction_date transaction_count
# <int> <int> <fct> <int>
# 1 1 7 2010-03-10 1
# 2 1 8 2010-03-10 0
# 3 2 9 2010-03-15 1
# 4 3 10 2010-03-15 1
# 5 4 10 2010-03-15 1
# 6 4 11 2010-03-15 0
# 7 5 12 2010-03-20 0
答案 1 :(得分:2)
使用dplyr
和tidyr
的解决方案:
df %>% group_by(Customer_id,Seller_id, transaction_date) %>%
mutate(postfail=lag(transaction_status=="FAIL")) %>%
fill(postfail) %>% summarize(postfail_transactions=sum(postfail, na.rm=TRUE))
结果:
# A tibble: 7 x 3
# Groups: Customer_id [?]
Customer_id Seller_id postfail_transactions
<int> <int> <int>
1 1 7 1
2 1 8 0
3 2 9 1
4 3 10 1
5 4 10 1
6 4 11 0
7 5 12 0
请注意,在第一次失败(包括更多失败)之后,我会统计任何个交易。
如果您希望将信息作为同一日期框架中的另一列,请删除最后一步。
答案 2 :(得分:2)
这里是第一次失败count_success
之后所有成功的计数,以及第一次失败count_all
之后所有的事务
.N
是当前组中的行数,which.max(fail)
给出了fail
为TRUE
的第一个匹配项,因此.N - which.max(fail)
给出了第一次失败后的行数(交易)。
如果行号TRUE
大于第一次失败的行号seq(.N)
,并且失败,则成功次数被计算为逻辑向量之和which.max(fail)
是FALSE
,即!fail
。
library(data.table)
setDT(df)
# Number of Successful transactions after first failure
df[, {fail <- transaction_status == 'FAIL'
.(count_all = .N - which.max(fail),
count_success = sum(!fail & (seq(.N) > which.max(fail))))}
, by = .(Customer_id, Seller_id, transaction_date)]
# Customer_id Seller_id transaction_date count_all count_success
# 1: 1 7 2010-03-10 1 1
# 2: 1 8 2010-03-10 0 0
# 3: 2 9 2010-03-15 1 0
# 4: 3 10 2010-03-15 1 1
# 5: 4 10 2010-03-15 1 0
# 6: 4 11 2010-03-15 0 0
# 7: 5 12 2010-03-20 0 0