我有一个35万行的大型data.frame。目前,我正在尝试在data.table中创建lag_close列的滚动平均值。以来
与dplyr
相比,我在data.table
上的语言更为流利,我尝试使用dplyr
创建滚动平均值,这给我带来了错误。
df = df %>% na.omit()%>% group_by(ticker) %>% mutate(avg10 = rollapplyr(lag_close, 10, mean))
错误:列
avg10
的长度必须为577(组大小)或一个而不是长度 567
我的下一个尝试是走上data.table路线,因为我看到人们成功了。 没有错误返回,但是输出非常奇怪并且不准确。
df = as.data.table(df)
df[, laggedRets := rollapplyr(lag_close, 10, mean, na.rm = T), by = ticker]
tail(df, 10)
ticker ret.adjusted.prices ret.closing.prices lag_close jump laggedRets
1: MELI 0.002731904 0.002731904 501.48 0.020174791 507.849
2: MELI 0.007258614 0.007258614 502.85 0.001966823 507.528
3: MELI -0.002606134 -0.002606134 506.50 -0.011437339 507.085
4: MELI 0.018132218 0.018132218 505.18 0.003832865 506.282
5: MELI -0.020628408 -0.020628408 514.34 -0.014373574 505.262
6: MELI 0.001568257 0.001568257 503.73 0.003428456 NA
7: MELI 0.018155879 0.018155879 504.52 -0.002957652 NA
8: MELI -0.022815765 -0.022815765 513.68 0.002663470 NA
9: MELI -0.007132015 -0.007132015 501.96 -0.005273231 NA
10: MELI -0.012801487 -0.012801487 498.38 -0.016467802 NA
head(df, 10)
ticker ret.adjusted.prices ret.closing.prices lag_close jump laggedRets
1: AA 0.0496011100 0.0496011100 28.83 0.006568736 NA
2: AA 0.0128883014 0.0128883014 30.26 -0.002647255 NA
3: AA 0.0009787928 0.0009787928 30.65 0.001304239 NA
4: AA -0.0391134289 -0.0391134289 30.68 0.006497716 NA
5: AA 0.0508819539 0.0508819539 29.48 0.011802531 31.154
6: AA 0.0319560684 0.0319560684 30.98 0.003222691 31.752
7: AA 0.0334689407 0.0334689407 31.97 0.031709630 32.268
8: AA -0.0009080811 -0.0009080811 33.04 -0.006680893 32.769
9: AA -0.0112086950 -0.0112086950 33.01 -0.001212398 33.330
10: AA 0.0664829064 0.0664829064 32.64 0.009452792 34.132
似乎data.table的两边都有NA,事实并非如此。而且数字也不准确。我不确定我缺少什么。 这是一个样本,所以你们可以测试一下。
df =structure(list(ticker = c("AA", "AA", "AA", "AA", "AA", "AA",
"AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA",
"AA", "AA", "AA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA",
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AABA",
"AABA", "AABA", "AABA", "AABA", "AABA", "AABA", "AAL", "AAL",
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL",
"AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL", "AAL",
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI",
"AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI", "AAOI",
"AAOI", "AAOI", "AAOI", "AAOI", "AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL",
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV",
"ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV", "ABBV",
"ABBV", "ABBV", "ABBV", "ABBV"), lag_close = c(28.83, 30.26,
30.65, 30.68, 29.48, 30.98, 31.969999, 33.040001, 33.009998,
32.639999, 34.810001, 35.419998, 35.66, 36.290001, 37.5, 36.529999,
36.299999, 36.669998, 36.549999, 36.450001, 38.900002, 40.060001,
41.34, 41.23, 41.34, 42.299999, 42.59, 42.110001, 42.27, 41.990002,
42.029999, 42.09, 42.049999, 42.400002, 43.900002, 44.939999,
44.549999, 44.419998, 43.93, 44.07, 46.299999, 46.700001, 45.889999,
46.209999, 47.080002, 48.48, 48.639999, 48.099998, 47.650002,
46.75, 47.639999, 47.259998, 48, 46.939999, 47.540001, 47.91,
49.59, 46.950001, 44.900002, 44.25, 23.1, 23.51, 22.809999, 22.42,
22.24, 22.870001, 22.719999, 29.860001, 28.219999, 30.440001,
30.84, 30.48, 30.58, 31.290001, 32, 31.709999, 31.280001, 31.15,
30.41, 30.76, 116.150002, 116.019997, 116.610001, 117.910004,
118.989998, 119.110001, 119.75, 119.25, 119.040001, 120, 119.989998,
119.779999, 120, 120.080002, 119.970001, 121.879997, 121.940002,
121.949997, 121.629997, 121.349998, 62.41, 63.290001, 63.77,
63.790001, 64.209999, 64.07, 61.139999, 61.279999, 61.990002,
61.860001, 61.66, 61.380001, 61.150002, 60.959999, 60.580002,
61.380001, 61.27, 60, 60.43, 61.110001)), row.names = c(NA, -120L
), class = "data.frame", .Names = c("ticker", "lag_close"))
答案 0 :(得分:2)
这里是Safari
方法,第1-9行为NA。
data.table()
此外,目前尚不清楚是否应在滚动平均值之前删除dt <- as.data.table(df)
dt[, laggedRets := frollmean(lag_close, 10, na.rm = T), by = ticker]
。如果是这样,您可以使用:
NA
答案 1 :(得分:1)
您可以使用map
包中的purrr
并将其应用于1:n()
:
df = df %>%
na.omit() %>%
group_by(ticker) %>%
mutate(avg10 = map_dbl(1:n(), ~mean(lag_close[(max(.x-9, 1)):.x], na.rm =T))
当然,您必须确定在少于9个观察值的前9行中应该如何处理。在我的解决方案中,第1至9行包含最后1至9个观察值的平均值。