我在R中有RAM和性能问题。近1100万行的已排序data.frame“input.sort”具有以下结构:
CNr DY AY C R
1 19730000123 2010 2000 0 0
2 19730000123 2011 2000 0 0
3 19730000123 2012 2000 0 0
4 19730000123 2013 2000 0 0
5 19730000123 2010 1997 500 10000
6 19930100025 1993 1993 500 0
7 19930100025 1994 1993 500 0
8 19930100025 1995 1993 500 0
9 19930100025 1996 1993 500 0
10 19930100025 1997 1993 500 0
11 19930100025 1998 1993 500 0
12 19930100025 1999 1993 500 0
13 19930100025 2000 1993 500 0
14 19930100025 2001 1993 500 0
15 19930100025 2002 1993 500 0
16 19930100025 2003 1993 500 0
17 19930100025 2004 1993 500 0
18 19930100025 2005 1993 500 0
19 19930100025 2006 1993 500 0
20 19930100025 2007 1993 500 0
21 19930100025 2008 1993 500 0
22 19930100025 2009 1993 500 0
23 19930100025 2010 1993 500 0
24 19930100025 2011 1993 500 0
25 19930100025 2012 1993 500 0
26 19930100025 2013 1993 500 0
27 19930100029 1993 1993 5000 1000
28 19930100029 1994 1993 6000 1000
29 19930100029 1995 1993 6000 0
30 19930100029 1996 1993 6000 0
31 19930100029 1997 1993 6000 0
32 19930100029 1998 1993 6000 0
33 19930100029 1999 1993 6000 0
34 19930100029 2000 1993 6000 0
35 19930100029 2001 1993 6000 0
36 19930100029 2002 1993 6000 0
37 19930100029 2003 1993 6000 0
38 19930100029 2004 1993 6000 0
39 19930100029 2005 1993 6000 0
40 19930100029 2006 1993 6000 0
41 19930100029 2007 1993 6000 0
42 19930100029 2008 1993 6000 0
43 19930100029 2009 1993 6000 0
44 19930100029 2010 1993 6000 0
45 19930100029 2011 1993 6000 0
46 19930100029 2012 1993 6000 0
47 19930100029 2013 1993 6000 0
48 19930100035 1993 1993 3000 2000
49 19930100035 1994 1993 4000 2000
50 19930100035 1995 1993 5000 1000
51 19930100035 1996 1993 5000 0
52 19930100035 1997 1993 5000 0
53 19930100035 1998 1993 5000 0
54 19930100035 1999 1993 5000 0
55 19930100035 2000 1993 5000 0
56 19930100035 2001 1993 5000 0
57 19930100035 2002 1993 5000 0
58 19930100035 2003 1993 5000 0
59 19930100035 2004 1993 5000 0
60 19930100035 2005 1993 5000 0
由CNr
分组,有965.110个不同的CNr
。
我的目标是在结尾处有3个数据集“PAID”,“INCREMENTAL”和“RES”。
每行包含属于一个CNr
的信息。
例如:
在“付费”中,一行包含:CNr
,AY
,“C列的转置”,
在“INCREMENTAL”中,一行包含:CNr
,AY
,“C的计算”,
在“RES”中,一行包含:CNr
,AY
,“R列的转置”。
为此,我编写了以下R代码:
# Analysis: min and max of accident year (AY) and development year (DY), DY k
dim(input.sort)
MinAY <- min(input.sort$AY)
MaxAY <- max(input.sort$AY)
MinDY<- min(input.sort$DY)
MaxDY<- max(input.sort$DY)
k <- MaxDY-MinAY+1
# Unique claim numbers
ID <- input.sort$CNr
ID.Unique <- unique(ID)
ID.VL <- rle(sort(ID)) #value and lengths of runs of equal values in ID
ID.Freq <- data.frame(number=ID.VL$values, n=ID.VL$length)
# Initialize data structures
PAID <- matrix(, nrow=nrow(ID.Freq), ncol=2+k)
colnames(PAID) <- c("LNr", "AY", 1:k)
INCREMENTAL <- matrix(, nrow=nrow(ID.Freq), ncol=2+k)
colnames(INCREMENTAL) <- c("LNr", "AY", 1:k)
RES <- matrix(, nrow=nrow(ID.Freq), ncol=2+k)
colnames(RES) <- c("LNr", "AY", 1:k)
# For each claim: accumulated claims amount, incremental claims amount, outstanding claims amount
# Control variable
r <- 1
system.time(for(i in 1:nrow(ID.Freq)) {
# Select claim
LNr <- ID.Freq[i,1]
Loss <- input.sort[r:(r+
ID.Freq[i,2]-1),]
Loss.sort <- Loss[order(Loss$DY),]
AY <- tail(Loss.sort$AY,1)
Dev <- Loss.sort$DY-AY+1
# Testing of incorrect data
if (floor(LNr/1e7) < AY){r <- r+ID.Freq[i,2]; next}
if (max(Loss.sort$DY) < MaxDY){r <- r+ID.Freq[i,2]; next}
if (length(Dev) < MaxDY-AY+1){r <- r+ID.Freq[i,2]; next}
# Accumulated claims amount(C0), incremental claims amount(S), outstanding claims amount (R)
C0 <- if (min(Dev)==1) Loss.sort[,4] else c(rep(0,min(Dev)-1),Loss.sort[,4])
C1 <- if (min(Dev)==1) c(0,C0[1:(max(Dev))-1]) else c(rep(0,min(Dev)), C0[(2+min(Dev)-1):(max(Dev))])
S <- C0-C1
R <- if (min(Dev)==1) Loss.sort[,5] else c(rep(0,min(Dev)-1),Loss.sort[,5])
PAID[i,] <- t(c(LNr, AY, C0, rep(NA, k-(max(Dev)))))
INCREMENTAL[i,] <- t(c(LNr, AY, S, rep(NA, k-(max(Dev)))))
RES[i,] <- t(c(LNr, AY, R, rep(NA, k-(max(Dev)))))
# Next loop
r <- r+ID.Freq[i,2]
})
运行程序时出错:向量ob 169.4 MB无法分配(Win 7,Intel i5,2,67 GHz,4 GB RAM)。程序停止了。只能通过拆分,例如
ID.Freq <- ID.Freq[1:(length(ID.Unique)/2),]
,
程序会做什么,它应该做什么。
有人有想法,我该如何改进代码?