Question

我需要一种很好的方法来处理大型数据框并在组上进行计算。

这是我的示例数据：

df_names <- data.frame()
for (int_x in 1: 1000)
  {
  df_names <- rbind( df_names, sample (c("John", "Peter", "Michael", "Lisa", "George", "Linda", 
 "Fresco", "Pope", "Niclas", "Rammen"),1)) 
}


 df <- data.frame(date= Sys.Date() + sort(sample(1:5000, 1000)), 
                 score= runif(1000, min = 25, max=500),
                names= df_names)

现在我要查找每个名称的滚动标准差（按日期排序），我这样做；

unique_names <- unique(df$names)
 for (int_y in 1:NROW(unique_names)){

  df %>% filter(names== unique_names[int_y]) %>%  arrange(date) %>% select( score)%>% as.matrix() %>% 
 rollapplyr( 5, sd, fill = 0)  }

问题1：如何将矩阵恢复到原始数据帧？麻烦2：我认为这是一种不好的做法。是否有整齐的方法

Answer 1

您正在计算滚动sd，但很快将其丢弃，我们可以解决。另外，我认为我们可以改善流程，以充分使用tidyverse分组，并将所有内容保持在一起。

set.seed(42)
df_names <- data.frame()
for (int_x in 1: 1000) {
  df_names <- rbind( df_names, name = sample (c("John", "Peter", "Michael", "Lisa", "George", "Linda",
 "Fresco", "Pope", "Niclas", "Rammen"),1))
}
df <- data.frame(date = Sys.Date() + sort(sample(1:5000, 1000)),
                 score = runif(1000, min = 25, max=500),
                 name = df_names[[1]])
str(df)
# 'data.frame': 1000 obs. of  3 variables:
#  $ date : Date, format: "2020-10-02" "2020-10-03" "2020-10-08" "2020-10-15" ...
#  $ score: num  285.1 343 41.7 78.7 351.7 ...
#  $ name : chr  "John" "George" "John" "Niclas" ...

流程：

library(dplyr)
# library(zoo) # rollapply
as_tibble(df) %>%
  arrange(date) %>%
  group_by(name) %>%
  mutate(rollsd = zoo::rollapply(score, 5, sd, fill = 0)) %>%
  ungroup() %>%
  slice(100:120) # arbitrary, just to show the middle
# # A tibble: 21 x 4
#    date       score name   rollsd
#    <date>     <dbl> <chr>   <dbl>
#  1 2022-02-22 365.  Niclas   74.7
#  2 2022-02-26 388.  George  107. 
#  3 2022-02-27 275.  Niclas   74.7
#  4 2022-03-05 171.  Rammen  158. 
#  5 2022-03-07 500.  Pope    150. 
#  6 2022-03-08 278.  Fresco   81.2
#  7 2022-03-11  37.7 Linda   204. 
#  8 2022-03-12  46.9 John    180. 
#  9 2022-03-13 224.  George  110. 
# 10 2022-03-16 302.  Niclas   63.4
# # ... with 11 more rows

Answer 2

带有data.table：

import csv
with open('Airports.txt', 'r') as f:
    reader = csv.reader(f)
    amr_csv = list(reader)
    for line in amr_csv:
        print(line[0])

^{由reprex package（v0.3.0）于2020-09-30创建}

计算分组数据框中的滚动标准偏差

2 个答案: