我是R的新手,需要帮助创建历史时期的变量。
让我们假设,我有以下数据结构
User_ID Tran_date Fraud_ind
A 1-Jan-15 1
A 2-Jan-15 1
A 3-Jan-15 0
A 4-Jan-13 0
A 5-Jan-10 1
我需要使用滚动窗口创建变量。这意味着,我需要创建过去365天内与User_Id
A相对应的欺诈率。这种情况下的答案应该是
(过去365天的欺诈交易次数)/(交易次数 在过去365天内)
是
2/3 = 66.66%
请帮我在R
中计算一下答案 0 :(得分:0)
您可以使用rollmean功能,只需确保您的数据也已订购:
library(dplyr)
library(zoo)
TS_data<-read.csv("data.csv",stringsAsFactors = F)
Roll.Mean <- TS_data %>%
filter(User_ID == "A") %>%
mutate(
avg.365 = rollmean(x = Fraud_ind,
k = 3,
fill = NA)
)
>Roll.Mean
User_ID Tran_date Fraud_ind avg.365
1 A 01-Jan-15 1 NA
2 A 02-Jan-15 1 0.6666667
3 A 03-Jan-15 0 0.3333333
4 A 04-Jan-13 0 0.3333333
5 A 05-Jan-10 1 NA
显然,在您的情况下,k
将为k=365
答案 1 :(得分:0)
使用简单的非滚动参数化聚合可能更容易。这就是我的想法:
fraudRate <- function(df,endDate,lookbackDays) {
endDate <- as.Date(endDate);
startDate <- endDate-lookbackDays+1L;
df <- subset(df,Tran_date>=startDate & Tran_date<=endDate);
aggregate(Fraud_ind~User_ID,df,function(x) sum(x)/length(x));
}; ## end fraudRate()
您可以在fraudRate()
上运行循环,为不同的endDate
/ lookbackDays
参数计算它。
演示:
## generate data
set.seed(1L);
NU <- 3L; ND <- 365L*2L; NT <- 15L; probFraud <- 1/3;
df <- data.frame(
User_ID=sample(LETTERS[1:3],NT,T),
Tran_date=sub('^0','',format(sort(sample(seq(as.Date('2014-01-01'),by=1L,len=ND),NT,T)),'%d-%b-%y')),
Fraud_ind=sample(c(1,0),NT,T,c(probFraud,1-probFraud))
);
## clean up data
df$Tran_date <- as.Date(df$Tran_date,'%d-%b-%y'); ## date column to R Date type
df$Fraud_ind <- df$Fraud_ind==1; ## fraud column to R logical type
df;
## User_ID Tran_date Fraud_ind
## 1 A 2014-01-10 FALSE
## 2 B 2014-04-02 FALSE
## 3 B 2014-06-04 FALSE
## 4 C 2014-07-15 FALSE
## 5 A 2014-09-06 TRUE
## 6 C 2014-10-05 TRUE
## 7 C 2014-10-07 TRUE
## 8 B 2014-10-09 FALSE
## 9 B 2014-12-30 TRUE
## 10 A 2015-04-21 FALSE
## 11 A 2015-06-08 TRUE
## 12 A 2015-07-22 FALSE
## 13 C 2015-09-27 TRUE
## 14 B 2015-11-14 FALSE
## 15 C 2015-12-26 FALSE
fraudRate(df,'2015-06-01',365L);
## User_ID Fraud_ind
## 1 A 0.5000000
## 2 B 0.3333333
## 3 C 0.6666667
演示您的示例数据:
df <- data.frame(User_ID=c('A','A','A','A','A'),Tran_date=c('1-Jan-15','2-Jan-15','3-Jan-15','4-Jan-13','5-Jan-10'),Fraud_ind=c(1L,1L,0L,0L,1L),stringsAsFactors=F);
df$Tran_date <- as.Date(df$Tran_date,'%d-%b-%y'); ## date column to R Date type
df$Fraud_ind <- df$Fraud_ind==1; ## fraud column to R logical type
df;
## User_ID Tran_date Fraud_ind
## 1 A 2015-01-01 TRUE
## 2 A 2015-01-02 TRUE
## 3 A 2015-01-03 FALSE
## 4 A 2013-01-04 FALSE
## 5 A 2010-01-05 TRUE
fraudRate(df,max(df$Tran_date),365L);
## User_ID Fraud_ind
## 1 A 0.6666667
答案 2 :(得分:0)
与@bgoldst类似的解决方案:
# create numerical julian date for each transaction
dat$Tran_date <- as.Date(dat$Tran_date, "%d-%b-%y")
dat$jday<-as.numeric(dat$Tran_date)
# function to count number of frauds / total number of transactions in 365 days of x
fraud_fun<-function(x){
frauds<-sum(dat[((x - dat$jday) <=365) & ((x - dat$jday) >=0), "Fraud_ind"])
total <- nrow(dat[((x - dat$jday) <=365) & ((x - dat$jday) >=0),])
frauds/total
}
dat$fraud_365<-sapply(dat$jday, fraud_fun)
User_ID Tran_date Fraud_ind jday fraud_365
1 A 2015-01-01 1 16436 1.0000000
2 A 2015-01-02 1 16437 1.0000000
3 A 2015-01-03 0 16438 0.6666667
4 A 2013-01-04 0 15709 0.0000000
5 A 2010-01-05 1 14614 1.0000000