Question

我有一系列报告的余额，并从信用信息局报告了几个月。我想按报告的月份计算消费者的曝光率。我有大约200万条记录要处理，我正在寻找R的解决方案。

I / P数据：

df <- data.frame("id" = c(1,1)
,"reported_date_hist" = c("20170830,20170728,20170630",
                          "20170730,20170620,20170525")

,"cur_bal_hist" = c("12455,14085,16940",
                "0,1260,2467"))

O / P：

  id         reported_date_hist      cur_bal_hist
1  1 20170830,20170728,20170631      12455,14085,16940
2  1 20170730,20170620,20170525      0,1260,2467

我想要一个o / p如下：

df <- data.frame("id" = c(1,1)
            ,"c201708"=c(12455,0)
            ,"c201707"=c(14085,0)
            ,"c201706"=c(16940,1260)
            ,"c201505"=c(0,2467))

O / P：

  id c201708 c201707 c201706 c201505
1  1   12455   14085   16940       0
2  1       0       0    1260    2467

之后我打算在每个月对他的余额进行分组，并将其最大化。

任何帮助将不胜感激。

Answer 1

这是使用tidyverse的想法。我们将字符串和unnest数据帧拆分为长格式。我们转换为datetime（as.POSIXct）并使用format仅获得年/月。我们对此进行分组，创建一个新变量，其中每个组的长度为seq（为了避免重复的标识符），我们使用spread转换为宽格式，即

library(tidyverse)

df %>% 
 mutate(reported_date_hist = strsplit(as.character(reported_date_hist), ','), 
        cur_bal_hist = strsplit(as.character(cur_bal_hist), ',')) %>% 
 unnest() %>% 
 mutate(reported_date_hist = format(as.POSIXct(reported_date_hist, format = '%Y%m%d'), 
                                                                     format = '%Y%m')) %>% 
 group_by(reported_date_hist) %>% 
 mutate(new = seq(n())) %>% 
 spread(reported_date_hist, cur_bal_hist)

给出，

# A tibble: 2 x 6
     id   new `201705` `201706` `201707` `201708`
* <dbl> <int>    <chr>    <chr>    <chr>    <chr>
1     1     1     2467    16940    14085    12455
2     1     2     <NA>     1260        0     <NA>

注意：您可以在末尾添加... %>% select(-new)以删除变量new。如果需要，rename也可用于更改列名。

Answer 2

这对我有用（已经用los_app_id替换了id）。 'data'是保存原始数据的数据帧。通过拆分和取消列出结果来创建DFlong df。使用过去36个月的列表来过滤旧报告的余额。使用reshape2包的dcast来获得每个los_app_id的总余额的月份视图（使用总和得到总数）。现在获得这些列的最大值很容易。

DFlong<- data.frame(los_app_id = rep.int(data$los_app_id, sapply(strsplit(as.character(data$reported_date_hist), ','), length)), 
                yearMM = unlist(strsplit(as.character(data$reported_date_hist), ',')),
                bal    = unlist(strsplit(as.character(data$cur_bal_hist), ',')))

DFlong$yearMM  <-  gsub("","",DFlong$yearMM)

DFlong$yearMM <- format(as.POSIXct(DFlong$yearMM, format = '%Y%m%d'), format = '%Y%m')

last36months <- seq(as.Date(Sys.Date()), length=36, by="-1 month")
last36months <- format(as.POSIXct(last36months, format = '%Y-%m-%d'), format = '%Y%m')

DFlong$bal  <-  gsub("","",DFlong$bal)
DFlong$bal <- as.numeric(DFlong$bal)
require(reshape2)
DFwide <- dcast(DFlong, los_app_id~yearMM, sum, na.rm=TRUE)
DFwide$Maximum_Indebtedness <- apply(DFwide[2:ncol(DFwide)],1,max, is.na= FALSE, na.rm = TRUE)

result <- DFwide[,c('los_app_id','Maximum_Indebtedness')]

客户的最大风险 - 按报告的月份分配报告的余额并动态分配

2 个答案: