从R中的历史窗口创建变量

时间:2016-05-13 07:27:53

标签: r

我是R的新手,需要帮助创建历史时期的变量。

让我们假设,我有以下数据结构

User_ID Tran_date   Fraud_ind
A       1-Jan-15    1
A       2-Jan-15    1
A       3-Jan-15    0
A       4-Jan-13    0
A       5-Jan-10    1

我需要使用滚动窗口创建变量。这意味着,我需要创建过去365天内与User_Id A相对应的欺诈率。这种情况下的答案应该是

  

(过去365天的欺诈交易次数)/(交易次数   在过去365天内)

  

2/3 = 66.66%

请帮我在R

中计算一下

3 个答案:

答案 0 :(得分:0)

您可以使用rollmean功能,只需确保您的数据也已订购:

library(dplyr)
library(zoo)

TS_data<-read.csv("data.csv",stringsAsFactors = F)

Roll.Mean <- TS_data %>%
  filter(User_ID == "A") %>% 
  mutate(
    avg.365  = rollmean(x = Fraud_ind,
                             k = 3,
                             fill = NA)
  )

>Roll.Mean

  User_ID Tran_date Fraud_ind   avg.365
1       A 01-Jan-15         1        NA
2       A 02-Jan-15         1 0.6666667
3       A 03-Jan-15         0 0.3333333
4       A 04-Jan-13         0 0.3333333
5       A 05-Jan-10         1        NA

显然,在您的情况下,k将为k=365

答案 1 :(得分:0)

使用简单的非滚动参数化聚合可能更容易。这就是我的想法:

fraudRate <- function(df,endDate,lookbackDays) {
    endDate <- as.Date(endDate);
    startDate <- endDate-lookbackDays+1L;
    df <- subset(df,Tran_date>=startDate & Tran_date<=endDate);
    aggregate(Fraud_ind~User_ID,df,function(x) sum(x)/length(x));
}; ## end fraudRate()

您可以在fraudRate()上运行循环,为不同的endDate / lookbackDays参数计算它。

演示:

## generate data
set.seed(1L);
NU <- 3L; ND <- 365L*2L; NT <- 15L; probFraud <- 1/3;
df <- data.frame(
    User_ID=sample(LETTERS[1:3],NT,T),
    Tran_date=sub('^0','',format(sort(sample(seq(as.Date('2014-01-01'),by=1L,len=ND),NT,T)),'%d-%b-%y')),
    Fraud_ind=sample(c(1,0),NT,T,c(probFraud,1-probFraud))
);
## clean up data
df$Tran_date <- as.Date(df$Tran_date,'%d-%b-%y'); ## date column to R Date type
df$Fraud_ind <- df$Fraud_ind==1; ## fraud column to R logical type
df;
##    User_ID  Tran_date Fraud_ind
## 1        A 2014-01-10     FALSE
## 2        B 2014-04-02     FALSE
## 3        B 2014-06-04     FALSE
## 4        C 2014-07-15     FALSE
## 5        A 2014-09-06      TRUE
## 6        C 2014-10-05      TRUE
## 7        C 2014-10-07      TRUE
## 8        B 2014-10-09     FALSE
## 9        B 2014-12-30      TRUE
## 10       A 2015-04-21     FALSE
## 11       A 2015-06-08      TRUE
## 12       A 2015-07-22     FALSE
## 13       C 2015-09-27      TRUE
## 14       B 2015-11-14     FALSE
## 15       C 2015-12-26     FALSE
fraudRate(df,'2015-06-01',365L);
##   User_ID Fraud_ind
## 1       A 0.5000000
## 2       B 0.3333333
## 3       C 0.6666667

演示您的示例数据:

df <- data.frame(User_ID=c('A','A','A','A','A'),Tran_date=c('1-Jan-15','2-Jan-15','3-Jan-15','4-Jan-13','5-Jan-10'),Fraud_ind=c(1L,1L,0L,0L,1L),stringsAsFactors=F);
df$Tran_date <- as.Date(df$Tran_date,'%d-%b-%y'); ## date column to R Date type
df$Fraud_ind <- df$Fraud_ind==1; ## fraud column to R logical type
df;
##   User_ID  Tran_date Fraud_ind
## 1       A 2015-01-01      TRUE
## 2       A 2015-01-02      TRUE
## 3       A 2015-01-03     FALSE
## 4       A 2013-01-04     FALSE
## 5       A 2010-01-05      TRUE
fraudRate(df,max(df$Tran_date),365L);
##   User_ID Fraud_ind
## 1       A 0.6666667

答案 2 :(得分:0)

与@bgoldst类似的解决方案:

# create numerical julian date for each transaction
dat$Tran_date <- as.Date(dat$Tran_date, "%d-%b-%y")
dat$jday<-as.numeric(dat$Tran_date)

# function to count number of frauds / total number of transactions in 365 days of x
fraud_fun<-function(x){
  frauds<-sum(dat[((x - dat$jday) <=365) & ((x - dat$jday) >=0), "Fraud_ind"])
  total <- nrow(dat[((x - dat$jday) <=365) & ((x - dat$jday) >=0),])
  frauds/total
} 


dat$fraud_365<-sapply(dat$jday, fraud_fun)
  User_ID  Tran_date Fraud_ind  jday fraud_365
1       A 2015-01-01         1 16436 1.0000000
2       A 2015-01-02         1 16437 1.0000000
3       A 2015-01-03         0 16438 0.6666667
4       A 2013-01-04         0 15709 0.0000000
5       A 2010-01-05         1 14614 1.0000000