我有以下巨大的数据框:
V1 V2 V3 V4
A E R 12
A R T 18
A T Y 44
A Y U 11
B E R 22
B R T 53
B T Y 11
B Y U 153
我想要做的是为每对V4
(V1,V2)
的异常值
根据每轮的V1
和V2
以及subset
的唯一值,使用2 for循环轻松处理,为每个子集获取V4的向量并获取异常值使用outlier
包的任何函数,但问题是速度。
我从未使用lapply
,也许有人可以指导我使用for lapply
for循环来有效地执行此操作。
答案 0 :(得分:1)
这是一个data.table
解决方案:
对于接近450万行,每组676组和6500条记录,只需2秒钟(包括数据生成)。
library(outliers)
library(data.table)
# Fake data generation and coercion to data.table
d <- as.data.table(expand.grid(x=LETTERS, y=LETTERS, z=LETTERS))
d <- do.call(rbind, replicate(250, d, FALSE))
# > d
# x y z value row
# 1: A A A -1.1712284 1
# 2: B A A 0.1818000 2
# 3: C A A -1.3959594 3
# 4: D A A -0.4778956 4
# 5: E A A -2.0426768 5
# ---
# 4393996: V Z Z 0.4024398 4393996
# 4393997: W Z Z 0.9891237 4393997
# 4393998: X Z Z 1.2066572 4393998
# 4393999: Y Z Z 2.3023321 4393999
# 4394000: Z Z Z -0.8343059 4394000
# Add random "value" column and a column to keep track of row numbers
d[, c('value', 'row'):=list(rnorm(nrow(d)), seq_len(nrow(d)))]
# For each group (combination of x and y), perform the outlier test
outliers <- d[, chisq.out.test(value), list(x, y)]
# Add the row numbers for min and max numbers of each group
outliers <- merge(outliers,
d[, list(min.ind=row[which.min(value)],
max.ind=row[which.max(value)]), list(x, y)],
by=c('x', 'y'))
# Create a new outlier column. If the p.value is >= 0.05, set outlier = NA,
# else if p.value < 0.5, then if "alternative" column contains "lowest", set
# outlier = min.ind, else max.ind.
outliers[, outlier:=ifelse(p.value < 0.05,
ifelse(grepl('lowest', outliers[, alternative]), min.ind, max.ind),
NA)]
输出如下所示:
# > outliers
# x y statistic alternative p.value method
# 1: A A 13.69290 highest value 3.70310786094858 is an outlier 2.152665e-04 chi-squared test for outlier
# 2: A B 11.99842 lowest value -3.47397308041372 is an outlier 5.324581e-04 chi-squared test for outlier
# 3: A C 12.41749 highest value 3.49833131757565 is an outlier 4.253310e-04 chi-squared test for outlier
# 4: A D 16.18416 lowest value -4.00696031141966 is an outlier 5.747273e-05 chi-squared test for outlier
# 5: A E 12.32196 lowest value -3.56650649267448 is an outlier 4.476613e-04 chi-squared test for outlier
# ---
# 672: Z V 11.66230 lowest value -3.43256736243089 is an outlier 6.377944e-04 chi-squared test for outlier
# 673: Z W 14.11816 highest value 3.75476979294983 is an outlier 1.716780e-04 chi-squared test for outlier
# 674: Z X 15.63605 highest value 3.93390421620766 is an outlier 7.677674e-05 chi-squared test for outlier
# 675: Z Y 17.05664 lowest value -4.12928000349912 is an outlier 3.628127e-05 chi-squared test for outlier
# 676: Z Z 14.44709 lowest value -3.82794835873449 is an outlier 1.441520e-04 chi-squared test for outlier
# data.name min.ind max.ind outlier
# 1: value 3609165 1191113 1191113
# 2: value 105483 3476019 105483
# 3: value 4153397 1375713 1375713
# 4: value 3406443 2539135 3406443
# 5: value 25117 2004445 25117
# ---
# 672: value 1871740 2551796 1871740
# 673: value 1003782 2158390 2158390
# 674: value 1555424 1492556 1492556
# 675: value 2071914 1344538 2071914
# 676: value 2281500 426556 2281500
有点可怜,但是嘿,它最终把我们带到了那里。