lapply而不是for循环

时间:2014-04-02 14:47:19

标签: r lapply

我有以下巨大的数据框:

V1  V2  V3  V4
A   E   R   12
A   R   T   18
A   T   Y   44
A   Y   U   11
B   E   R   22
B   R   T   53
B   T   Y   11
B   Y   U   153 

我想要做的是为每对V4

获取(V1,V2)的异常值

根据每轮的V1V2以及subset的唯一值,使用2 for循环轻松处理,为每个子集获取V4的向量并获取异常值使用outlier包的任何函数,但问题是速度。

我从未使用lapply,也许有人可以指导我使用for lapply for循环来有效地执行此操作。

1 个答案:

答案 0 :(得分:1)

这是一个data.table解决方案:

对于接近450万行,每组676组和6500条记录,只需2秒钟(包括数据生成)。

library(outliers)
library(data.table)

# Fake data generation and coercion to data.table
d <- as.data.table(expand.grid(x=LETTERS, y=LETTERS, z=LETTERS))
d <- do.call(rbind, replicate(250, d, FALSE))

# > d
#          x y z      value     row
#       1: A A A -1.1712284       1
#       2: B A A  0.1818000       2
#       3: C A A -1.3959594       3
#       4: D A A -0.4778956       4
#       5: E A A -2.0426768       5
#      ---                         
# 4393996: V Z Z  0.4024398 4393996
# 4393997: W Z Z  0.9891237 4393997
# 4393998: X Z Z  1.2066572 4393998
# 4393999: Y Z Z  2.3023321 4393999
# 4394000: Z Z Z -0.8343059 4394000

# Add random "value" column and a column to keep track of row numbers
d[, c('value', 'row'):=list(rnorm(nrow(d)), seq_len(nrow(d)))]

# For each group (combination of x and y), perform the outlier test
outliers <- d[, chisq.out.test(value), list(x, y)]

# Add the row numbers for min and max numbers of each group
outliers <- merge(outliers, 
                  d[, list(min.ind=row[which.min(value)], 
                           max.ind=row[which.max(value)]), list(x, y)], 
                  by=c('x', 'y'))

# Create a new outlier column. If the p.value is >= 0.05, set outlier = NA,
# else if p.value < 0.5, then if "alternative" column contains "lowest", set
# outlier = min.ind, else max.ind.
outliers[, outlier:=ifelse(p.value < 0.05, 
                  ifelse(grepl('lowest', outliers[, alternative]), min.ind, max.ind), 
                  NA)]

输出如下所示:

# > outliers
#      x y statistic                                  alternative      p.value                       method
#   1: A A  13.69290 highest value 3.70310786094858 is an outlier 2.152665e-04 chi-squared test for outlier
#   2: A B  11.99842 lowest value -3.47397308041372 is an outlier 5.324581e-04 chi-squared test for outlier
#   3: A C  12.41749 highest value 3.49833131757565 is an outlier 4.253310e-04 chi-squared test for outlier
#   4: A D  16.18416 lowest value -4.00696031141966 is an outlier 5.747273e-05 chi-squared test for outlier
#   5: A E  12.32196 lowest value -3.56650649267448 is an outlier 4.476613e-04 chi-squared test for outlier
#  ---                                                                                                     
# 672: Z V  11.66230 lowest value -3.43256736243089 is an outlier 6.377944e-04 chi-squared test for outlier
# 673: Z W  14.11816 highest value 3.75476979294983 is an outlier 1.716780e-04 chi-squared test for outlier
# 674: Z X  15.63605 highest value 3.93390421620766 is an outlier 7.677674e-05 chi-squared test for outlier
# 675: Z Y  17.05664 lowest value -4.12928000349912 is an outlier 3.628127e-05 chi-squared test for outlier
# 676: Z Z  14.44709 lowest value -3.82794835873449 is an outlier 1.441520e-04 chi-squared test for outlier
#      data.name min.ind max.ind outlier
#   1:     value 3609165 1191113 1191113
#   2:     value  105483 3476019  105483
#   3:     value 4153397 1375713 1375713
#   4:     value 3406443 2539135 3406443
#   5:     value   25117 2004445   25117
#  ---                                  
# 672:     value 1871740 2551796 1871740
# 673:     value 1003782 2158390 2158390
# 674:     value 1555424 1492556 1492556
# 675:     value 2071914 1344538 2071914
# 676:     value 2281500  426556 2281500

有点可怜,但是嘿,它最终把我们带到了那里。