Question

我有一个非常大的（近6米行）数据帧，称为DF，具有以下结构：

CodeContract    RelMonth    AmtPmt
A0001           10          0.00
A0001           11          15.00
A0002           12          4.55
A0003           4           0.00
...             ...         ...

RelMonth定义为与CodeContract相关联的特定静态事件后的月数。

此数据已按CodeContract和RelMonth排序。数据框目前保持连续RelMonth;即对于任何给定的CodeContract，如果我有Min RelMonth和Max CodeContract，则填充所有临时RelMonth=5，例如给定RelMonth=12的那些，然后数据框架将包含RelMonths 5:12。

我想计算另一个名为Mths_since_last_Pmt的列，该列会计算给定的CodeContract已经有多少RelMonths，因为给定的CodeContract有一个AmtPmt > Amt_threshold 1}}。

它会像这样（假设Amt_threshold=5）

CodeContract    RelMonth    AmtPmt  Mths_since_last_Pmt
A0001           1           0.00    1
A0001           2           3.00    2
A0001           3           0.00    3
A0001           4           10.00   0
A0001           5           0.00    1
A0002           1           10.00   0
A0002           2           12.00   0
A0002           3           0.00    1
A0002           4           0.00    2

我目前有一个使用For循环的工作解决方案，但它只能处理大约5,000行/秒。

我正在寻找一种方法来对这个计算进行矢量化，甚至可能不先对数据进行排序，或者不间断RelMonths。

我尝试开发的所有矢量化解决方案（通常都使用调用ddply的{{1}}）最终会占用我的RAM（24GB）。我正在寻找一个运行在2GB RAM以下的解决方案。也许是自定义函数形式的解决方案？

知道如何使这项工作吗？

更新 @Roland

@Roland

我发现了一个略有不同的数据集，导致下面的代码输出错误。调整后的输入是：

seq_along

相应的输出是：

DF <- read.table(text="CodeContract    RelMonth    AmtPmt  Mths_since_last_Pmt
A0001           1           0.00    1
A0001           2           3.00    2
A0001           3           0.00    3
A0001           4           10.00   0
A0001           5           0.00    1
A0002           1           1.00   0
A0002           2           14.00   0
A0002           3           14.00    1
A0002           4           14.00    2",header=TRUE)

最后一行CodeContract RelMonth AmtPmt Mths_since_last_Pmt Mths_since_last_Pmt2 1: A0001 1 0 1 1 2: A0001 2 3 2 2 3: A0001 3 0 3 3 4: A0001 4 10 0 0 5: A0001 5 0 1 1 6: A0002 1 1 0 1 7: A0002 2 14 0 0 8: A0002 3 14 1 -1 9: A0002 4 14 2 -2中的负数-1和-2不正确;当阈值超过时，它们都应为Mths_since_last_Pmt2。当第一个项目是子组（这里通过0更改）低于阈值时，似乎算法失败就足以将其抛弃。

我们可以通过调整来实现这项工作吗？

Answer 1

试试这个：

DF <- read.table(text="CodeContract    RelMonth    AmtPmt  Mths_since_last_Pmt
A0001           1           0.00    1
A0001           2           3.00    2
A0001           3           0.00    3
A0001           4           10.00   0
A0001           5           0.00    1
A0002           1           10.00   0
A0002           2           12.00   0
A0002           3           0.00    1
A0002           4           0.00    2",header=TRUE)

library(data.table)

DT <- data.table(DF,key=c("CodeContract","RelMonth"))

trsh <- 5
DT[,Mths_since_last_Pmt2 := 
       cumsum(AmtPmt<=trsh)-cumsum(cumsum(AmtPmt<=trsh)*(AmtPmt>trsh)),
            by=CodeContract]

#    CodeContract RelMonth AmtPmt Mths_since_last_Pmt Mths_since_last_Pmt2
# 1:        A0001        1      0                   1                    1
# 2:        A0001        2      3                   2                    2
# 3:        A0001        3      0                   3                    3
# 4:        A0001        4     10                   0                    0
# 5:        A0001        5      0                   1                    1
# 6:        A0002        1     10                   0                    0
# 7:        A0002        2     12                   0                    0
# 8:        A0002        3      0                   1                    1
# 9:        A0002        4      0                   2                    2

希望data.table通过引用分配将使您保持在RAM限制之下。

Answer 2

好吧，我设法在SO上找到了一个类似issue的人，并且能够根据我的问题调整答案。感谢@ sven-hohenstein

答案是这样的：

require(data.table)
DF<-as.data.table(DF)

首先，我创建了一个阈值测试向量，如果1低于阈值，则返回AmtPmt：

DF$trsh_test[DF$AmtPmt<trsh]<-1
DF$trsh_test[is.na(DF$trsh_test)]<-0

其次，使用ave功能和seq_along

DF[,Mths_since_last_Pmt2 := 
        trsh_test * ave(trsh_test, c(0L, cumsum(diff(trsh_test) != 0)), 
        FUN = seq_along) ,
        by=CodeContract]

您获得以下输出，这是正确的：

CodeContract RelMonth AmtPmt Mths_since_last_Pmt trsh_test Mths_since_last_Pmt2
A0001        1      0                   1         1                    1
A0001        2      3                   2         1                    2
A0001        3      0                   3         1                    3
A0001        4     10                   0         0                    0
A0001        5      0                   1         1                    1
A0002        1      1                   0         1                    1
A0002        2     14                   0         0                    0
A0002        3     14                   1         0                    0
A0002        4     14                   2         0                    0

R中复位时多标准递增

2 个答案: