Question

如果我有一个包含3个变量的小时数据集（时间，a，b），并且想要查看“a”中具有异常值的特定日期的标准偏差“b”，我该怎么办？因此，想法是：如果变量“a”的值高于某个阈值，例如如下例所示99，整天变量“b”的标准偏差是多少。什么是前一天和后一天的“b”的sd。我试着通过一个例子澄清问题：

set.seed(1)
df = data.frame("time" =  seq( 
 from = as.POSIXct("2016-05-01 00:00", tz = "Europe/Berlin"), 
 to = as.POSIXct("2016-05-04 23:00", tz = "Europe/Berlin"),
 by = "hour"),  "a" = runif(96, min=0, max=100), "b" = runif(96, min=1200, 
 max=30000))

如果这是数据，我想写一个这样的命令：

test = data.frame("time" = df$time, "extreme" = ifelse(df$a> 99, sd(#take the sd of "b" for the day where df$a>99 occured) & sd(#and for the day before and after), 0 ))

test = subset(test, test$extreme>0) # to have a data frame with the important values only

我感谢任何帮助。

Answer 1

如果要查找a高于该阈值的那一天的所有值，然后计算前一天，后一天和后一天的b的标准差：

threshold_day <- day(df[df$a>99,]$time)
threshold_days <- c(threshold_day -1, threshold_day, threshold_day + 1)
outlier_days <- df[day(df$time) %in% threshold_days,]
outlier_days$sd_b <- sd(outlier_days$b)
head(outlier_days)
                 time        a        b     sd_b
# 1 2016-05-01 00:00:00 26.55087 14311.90 7730.978
# 2 2016-05-01 01:00:00 37.21239 13010.42 7730.978    
# 3 2016-05-01 02:00:00 57.28534 24553.06 7730.978
# 4 2016-05-01 03:00:00 90.82078 18622.08 7730.978
# 5 2016-05-01 04:00:00 20.16819 20056.05 7730.978
# 6 2016-05-01 05:00:00 89.83897 11372.08 7730.978

请注意，这仅包括当天和之后的一天（因为前一天没有数据，并且具有标准差的列通常不是非常有用（因为它是一个值），但我认为这就是你想要的......请澄清它是否是别的东西。

如果您希望单独使用标准偏差，并且希望按日分组，则只按天划分，然后应用sd。同样，您只需要两天（两组），因为您的阈值是在您拥有数据的第一天。所以你不能包括前一天（因为4月份没有数据）。

tapply(outlier_days$b, day(outlier_days$time), sd)

如果你真的希望它被分组，但是想要它在数据框中..你可以把它重新插入，但你可能最好使用dplyr：

threshold_day <- day(filter(df, a>99)$time)
threshold_days <- c(threshold_day -1, threshold_day, threshold_day + 1)
filter(df, day(time) %in% threshold_days) %>%
    group_by(day(time)) %>%
    mutate(sd_b = sd(b))

当然，如果你发送另一个具有不同数据的代表，比如一个带有其他月份的日期的代表，那么如果没有适合预期输入的修改，它将会失败。这就是为什么在预期输入中测试覆盖率非常重要的原因。例如，对于超过一个月的数据，您希望按完整日期进行分组，而不仅仅是当天。（每天交换日期（）（），您将获得适用于该数据的结果。

Answer 2

正如评论中已经指出的那样，您只有1个案例a > 99。因此，结果为NA。尽管如此，这是给你这个价值的代码：

library(tidyverse)
df %>% filter(a > 99) %>% mutate(sd_b = sd(b))

结果：

             time        a        b      sd_b
1 2016-05-01 17:00:00 99.19061 13626.44  NaN

请注意，如果您在NAs中有一个可能包含b的较大数据集，则必须考虑到这一点。

Answer 3

感谢您的帮助@Dan Hall。我使用了一些命令来找到正确的答案：

# Add additional variable with the daily sd of "b"
df_augmented = df  %>% group_by(date(time)) %>%
mutate(sd_price = sd(b)) 

#Filter the dates plus minus one day where the value is a>99
sd.extreme = data.frame("time" = df_augmented$time, 
                    "date" = date(df_augmented$time),
                    "sd_b_lagday" = ifelse(df_augmented$a>99, 
                                    Lag(df_augmented$sd_price, shift = 24) , 0),
                    "sd_b_day" = ifelse(df_augmented$a>99, 
                                 df_augmented$sd_price , 0),
                    "sd_b_leadday" = ifelse(df_augmented$a>99, 
                                     Lag(df_augmented$sd_price, shift = -24) , 0)
                    )

sd.extreme = subset(sd.extreme, sd.extreme$sd_b_day >0)

sd.extreme = sd.extreme[!duplicated(sd.extreme$date) ,]    

sd.extreme = sd.extreme[,-1]

当变量“a”的值高于阈值时，如何从一天的变量“b”获得sd

3 个答案: