Question

我有一个数据集，在传感器（1-16）进行测量，可变次数，然后重复。我希望每个序列中每个传感器的value均值。并非所有序列都从16回到1（有时需要去除杂散测量）。注意：这是一个小的假数据集。

dataset（也可以使用下面的脚本阅读）

# To read with rio  
# library("devtools")
# install_github("leeper/rio")
library("rio")
df <- import("https://gist.githubusercontent.com/karthik/ad2874e5b5c5f3af73ad89d14b26a913/raw/f435317539bc56a09b248a0ef193db21b7176eee/small.csv")

我的第一次尝试：

library(dplyr)
# Assigning groups to the data
df$diff <- c(df$sensor[2:nrow(df)], 0) - df$sensor
# There is sometimes a sensor reading between 16 and 1. This removes those rows.
df2 <- df[-which(df$diff < 0 & df$sensor != 16),]

# end is now where the last 16 was
end <- which(df2$diff < 0)
# Start begins with 1, then adds 1 to the position of every last 16 sensor
# reading to get the next 1
start <-
  c(1, which(df2$diff < 0)[1:length(which(df2$diff < 0)) - 1] + 1)
# Now combine both into a data.frame
positions <- data_frame(start, end)
# Add unique groups
positions$group <- 1:nrow(positions)
df2$group <- NA

# Yes this is a horrible loop and 
# super inefficient on the full dataset.
for (i in 1:nrow(positions)) {
  df2[positions[i,]$start:positions[i, ]$end, ]$group <-
    positions[i,]$group
}

现在可以轻松地使用dplyr

进行聚合

df3 <- df2 %>% 
  group_by(sensor,group) %>% 
  summarise(mean_value = mean(value))
  head(df3)

导致我想要的东西。

  Source: local data frame [6 x 3]
  Groups: sensor [4]

  sensor group mean_value
  (int) (int)      (dbl)
  1      1     2 0.07285933
  2      2     2 0.06993007
  3      3     1 0.04845651
  4      3     2 0.03976837
  5      4     1 0.06033732
  6      4     2 0.06480888

有什么更好的方法可以做到这一点？

Answer 1

您可以使用positions词汇表完成所有操作，而不是创建df2数据框，创建中间数据帧dplyr并使用for循环添加分组变量。通过使用cumsum和lag的组合，您可以使用mutate添加分组变量。这导致了更简化的程序：

df %>% 
  mutate(differ = lead(sensor) - sensor) %>% 
  filter(!(differ < 0 & sensor != 16)) %>% 
  mutate(grp = cumsum(lag(differ,default=0) < 0) + 1) %>% 
  group_by(sensor, grp) %>% 
  summarise(mean_val = mean(value))

给出：

Source: local data frame [30 x 3]
Groups: sensor [?]

   sensor   grp   mean_val
    (int) (dbl)      (dbl)
1       1     2 0.07285933
2       2     2 0.06993007
3       3     1 0.04845651
4       3     2 0.03976837
5       4     1 0.06033732
6       4     2 0.06480888
7       5     1 0.03276722
8       5     2 0.05005240
9       6     1 0.06967405
10      6     2 0.06484712
..    ...   ...        ...

注意：我使用differ作为变量名称而不是diff，因为后者也是一个函数（并且为列提供'funcion'名称是不明智的。）

你也可以使用data.table包：

library(data.table)
setDT(df)[, differ := shift(sensor, type='lead') - sensor
          ][!(differ < 0 & sensor != 16)
            ][, grp := cumsum(shift(differ,fill=0) < 0) + 1
              ][, .(mean_val = mean(value)), .(sensor,grp)]

其中setDT(df)会将您的数据框转换为数据表。

如何计算需要分组的数据集的聚合？

1 个答案: