R:性能,RAM优化代码,加速

时间:2015-01-30 12:29:42

标签: r performance ram

我在R中有RAM和性能问题。近1100万行的已排序data.frame“input.sort”具有以下结构:

           CNr   DY   AY    C     R
1  19730000123 2010 2000    0     0
2  19730000123 2011 2000    0     0
3  19730000123 2012 2000    0     0
4  19730000123 2013 2000    0     0
5  19730000123 2010 1997  500 10000
6  19930100025 1993 1993  500     0
7  19930100025 1994 1993  500     0
8  19930100025 1995 1993  500     0
9  19930100025 1996 1993  500     0
10 19930100025 1997 1993  500     0
11 19930100025 1998 1993  500     0
12 19930100025 1999 1993  500     0
13 19930100025 2000 1993  500     0
14 19930100025 2001 1993  500     0
15 19930100025 2002 1993  500     0
16 19930100025 2003 1993  500     0
17 19930100025 2004 1993  500     0
18 19930100025 2005 1993  500     0
19 19930100025 2006 1993  500     0
20 19930100025 2007 1993  500     0
21 19930100025 2008 1993  500     0
22 19930100025 2009 1993  500     0
23 19930100025 2010 1993  500     0
24 19930100025 2011 1993  500     0
25 19930100025 2012 1993  500     0
26 19930100025 2013 1993  500     0
27 19930100029 1993 1993 5000  1000
28 19930100029 1994 1993 6000  1000
29 19930100029 1995 1993 6000     0
30 19930100029 1996 1993 6000     0
31 19930100029 1997 1993 6000     0
32 19930100029 1998 1993 6000     0
33 19930100029 1999 1993 6000     0
34 19930100029 2000 1993 6000     0
35 19930100029 2001 1993 6000     0
36 19930100029 2002 1993 6000     0
37 19930100029 2003 1993 6000     0
38 19930100029 2004 1993 6000     0
39 19930100029 2005 1993 6000     0
40 19930100029 2006 1993 6000     0
41 19930100029 2007 1993 6000     0
42 19930100029 2008 1993 6000     0
43 19930100029 2009 1993 6000     0
44 19930100029 2010 1993 6000     0
45 19930100029 2011 1993 6000     0
46 19930100029 2012 1993 6000     0
47 19930100029 2013 1993 6000     0
48 19930100035 1993 1993 3000  2000
49 19930100035 1994 1993 4000  2000
50 19930100035 1995 1993 5000  1000
51 19930100035 1996 1993 5000     0
52 19930100035 1997 1993 5000     0
53 19930100035 1998 1993 5000     0
54 19930100035 1999 1993 5000     0
55 19930100035 2000 1993 5000     0
56 19930100035 2001 1993 5000     0
57 19930100035 2002 1993 5000     0
58 19930100035 2003 1993 5000     0
59 19930100035 2004 1993 5000     0
60 19930100035 2005 1993 5000     0

CNr分组,有965.110个不同的CNr。 我的目标是在结尾处有3个数据集“PAID”,“INCREMENTAL”和“RES”。 每行包含属于一个CNr的信息。

例如:

在“付费”中,一行包含:CNrAY,“C列的转置”, 在“INCREMENTAL”中,一行包含:CNrAY,“C的计算”, 在“RES”中,一行包含:CNrAY,“R列的转置”。

为此,我编写了以下R代码:

# Analysis: min and max of accident year (AY) and development year (DY), DY k
dim(input.sort)
MinAY <- min(input.sort$AY)
MaxAY <- max(input.sort$AY)
MinDY<- min(input.sort$DY)
MaxDY<- max(input.sort$DY)
k <- MaxDY-MinAY+1

# Unique claim numbers
ID <- input.sort$CNr
ID.Unique <- unique(ID)
ID.VL <- rle(sort(ID)) #value and lengths of runs of equal values in ID
ID.Freq <- data.frame(number=ID.VL$values, n=ID.VL$length)

# Initialize data structures
PAID <- matrix(, nrow=nrow(ID.Freq), ncol=2+k)
colnames(PAID) <- c("LNr", "AY", 1:k)
INCREMENTAL <- matrix(, nrow=nrow(ID.Freq), ncol=2+k)
colnames(INCREMENTAL) <- c("LNr", "AY", 1:k)
RES <- matrix(, nrow=nrow(ID.Freq), ncol=2+k)
colnames(RES) <- c("LNr", "AY", 1:k)

# For each claim:  accumulated claims amount, incremental claims amount, outstanding claims amount
# Control variable
r <- 1 
system.time(for(i in 1:nrow(ID.Freq))  { 
    # Select claim  
    LNr <- ID.Freq[i,1]
    Loss <- input.sort[r:(r+
    ID.Freq[i,2]-1),]
    Loss.sort <- Loss[order(Loss$DY),]
    AY <- tail(Loss.sort$AY,1)
    Dev <- Loss.sort$DY-AY+1

    # Testing of incorrect data
    if (floor(LNr/1e7) < AY){r <- r+ID.Freq[i,2]; next}
    if (max(Loss.sort$DY) < MaxDY){r <- r+ID.Freq[i,2]; next}
    if (length(Dev) < MaxDY-AY+1){r <- r+ID.Freq[i,2]; next}

    # Accumulated claims amount(C0), incremental claims amount(S), outstanding claims amount (R)
    C0 <- if (min(Dev)==1) Loss.sort[,4] else c(rep(0,min(Dev)-1),Loss.sort[,4])
    C1 <- if (min(Dev)==1) c(0,C0[1:(max(Dev))-1]) else c(rep(0,min(Dev)), C0[(2+min(Dev)-1):(max(Dev))])
    S <- C0-C1
    R <- if (min(Dev)==1) Loss.sort[,5] else c(rep(0,min(Dev)-1),Loss.sort[,5])
    PAID[i,] <- t(c(LNr, AY, C0, rep(NA, k-(max(Dev)))))
    INCREMENTAL[i,] <- t(c(LNr, AY, S, rep(NA, k-(max(Dev)))))
    RES[i,] <- t(c(LNr, AY, R, rep(NA, k-(max(Dev)))))

    # Next loop
    r <- r+ID.Freq[i,2]
})

运行程序时出错:向量ob 169.4 MB无法分配(Win 7,Intel i5,2,67 GHz,4 GB RAM)。程序停止了。只能通过拆分,例如

ID.Freq <- ID.Freq[1:(length(ID.Unique)/2),]

程序会做什么,它应该做什么。

有人有想法,我该如何改进代码?

0 个答案:

没有答案