我有一个数据集,在传感器(1-16)进行测量,可变次数,然后重复。我希望每个序列中每个传感器的value
均值。并非所有序列都从16回到1(有时需要去除杂散测量)。注意:这是一个小的假数据集。
dataset(也可以使用下面的脚本阅读)
# To read with rio
# library("devtools")
# install_github("leeper/rio")
library("rio")
df <- import("https://gist.githubusercontent.com/karthik/ad2874e5b5c5f3af73ad89d14b26a913/raw/f435317539bc56a09b248a0ef193db21b7176eee/small.csv")
我的第一次尝试:
library(dplyr)
# Assigning groups to the data
df$diff <- c(df$sensor[2:nrow(df)], 0) - df$sensor
# There is sometimes a sensor reading between 16 and 1. This removes those rows.
df2 <- df[-which(df$diff < 0 & df$sensor != 16),]
# end is now where the last 16 was
end <- which(df2$diff < 0)
# Start begins with 1, then adds 1 to the position of every last 16 sensor
# reading to get the next 1
start <-
c(1, which(df2$diff < 0)[1:length(which(df2$diff < 0)) - 1] + 1)
# Now combine both into a data.frame
positions <- data_frame(start, end)
# Add unique groups
positions$group <- 1:nrow(positions)
df2$group <- NA
# Yes this is a horrible loop and
# super inefficient on the full dataset.
for (i in 1:nrow(positions)) {
df2[positions[i,]$start:positions[i, ]$end, ]$group <-
positions[i,]$group
}
现在可以轻松地使用dplyr
df3 <- df2 %>%
group_by(sensor,group) %>%
summarise(mean_value = mean(value))
head(df3)
导致我想要的东西。
Source: local data frame [6 x 3]
Groups: sensor [4]
sensor group mean_value
(int) (int) (dbl)
1 1 2 0.07285933
2 2 2 0.06993007
3 3 1 0.04845651
4 3 2 0.03976837
5 4 1 0.06033732
6 4 2 0.06480888
有什么更好的方法可以做到这一点?
答案 0 :(得分:3)
您可以使用positions
词汇表完成所有操作,而不是创建df2
数据框,创建中间数据帧dplyr
并使用for循环添加分组变量。通过使用cumsum
和lag
的组合,您可以使用mutate
添加分组变量。这导致了更简化的程序:
df %>%
mutate(differ = lead(sensor) - sensor) %>%
filter(!(differ < 0 & sensor != 16)) %>%
mutate(grp = cumsum(lag(differ,default=0) < 0) + 1) %>%
group_by(sensor, grp) %>%
summarise(mean_val = mean(value))
给出:
Source: local data frame [30 x 3]
Groups: sensor [?]
sensor grp mean_val
(int) (dbl) (dbl)
1 1 2 0.07285933
2 2 2 0.06993007
3 3 1 0.04845651
4 3 2 0.03976837
5 4 1 0.06033732
6 4 2 0.06480888
7 5 1 0.03276722
8 5 2 0.05005240
9 6 1 0.06967405
10 6 2 0.06484712
.. ... ... ...
注意:我使用differ
作为变量名称而不是diff
,因为后者也是一个函数(并且为列提供'funcion'名称是不明智的。)
你也可以使用data.table
包:
library(data.table)
setDT(df)[, differ := shift(sensor, type='lead') - sensor
][!(differ < 0 & sensor != 16)
][, grp := cumsum(shift(differ,fill=0) < 0) + 1
][, .(mean_val = mean(value)), .(sensor,grp)]
其中setDT(df)
会将您的数据框转换为数据表。