我有一个数据框,我想删除包含异常值的任何一周。如果我可以将整周指示为异常值,我将很高兴,因为我了解如何从那里开始子集。我无法提出合适的解决方案。我一直认为我需要循环遍历几周的子集以达到预期目标,或者创建一个单独的函数来处理个别异常值周并使用sapply。我还没有使这两种解决方案都可行。
date <- seq(as.Date("2015-01-01"), length=365, by="1 day")
dow <- as.factor(weekdays(as.Date(date))
df <- data.frame(cbind(date, dow))
df$date <- as.Date(df$date,format="%m/%d/%Y",origin="01/01/1970")
df$dow <- as.factor(weekdays(as.Date(df$date)))
set.seed(1115)
df$var1 <- rnorm(365, 1912, 40795)
stdev <- sd(df$var1, na.rm=TRUE)
avg <- mean(df$var1, na.rm=TRUE)
df$LB <- avg-(2.75*stdev)
df$UB <- avg+(2.75*stdev)
df$outlier <- ifelse(df$var1<df$LB | df$var1>df$UB, 1,0)
df$weeknum <- as.numeric(format(df$date, "%U"))
head(df, 17)
> head(df, 17)
date dow var1 LB UB outlier weeknum
1 2015-01-01 Thursday -7828.412 -114675.6 120479.8 0 0
2 2015-01-02 Friday 25674.456 -114675.6 120479.8 0 0
3 2015-01-03 Saturday -33588.871 -114675.6 120479.8 0 0
4 2015-01-04 Sunday -54418.175 -114675.6 120479.8 0 1
5 2015-01-05 Monday -10002.002 -114675.6 120479.8 0 1
6 2015-01-06 Tuesday 34050.390 -114675.6 120479.8 0 1
7 2015-01-07 Wednesday -37584.648 -114675.6 120479.8 0 1
8 2015-01-08 Thursday 84048.878 -114675.6 120479.8 0 1
9 2015-01-09 Friday -24801.346 -114675.6 120479.8 0 1
10 2015-01-10 Saturday 33974.637 -114675.6 120479.8 0 1
11 2015-01-11 Sunday 77432.088 -114675.6 120479.8 0 2
12 2015-01-12 Monday 128196.236 -114675.6 120479.8 1 2
13 2015-01-13 Tuesday 9740.418 -114675.6 120479.8 0 2
14 2015-01-14 Wednesday 26539.887 -114675.6 120479.8 0 2
15 2015-01-15 Thursday 12172.834 -114675.6 120479.8 0 2
16 2015-01-16 Friday 1032.544 -114675.6 120479.8 0 2
17 2015-01-17 Saturday 76870.095 -114675.6 120479.8 0 2
在上面的例子中,所需的输出是1,每行中与outnum = 2对应的异常值列。
答案 0 :(得分:0)
你说&#34;所需的输出是1,每行的异常值列对应于weeknum = 2。&#34;那么你真的需要一个异常列吗?您似乎可以根据weeknum列的值简单地对data.frame
进行子集化,如下所示:
df <- df[!(df$weeknum==2),]
答案 1 :(得分:0)
答案涉及测试两个向量。一旦我意识到这一点,我就能够优化我的搜索并找到合适的答案here。
正确识别每个元素所需的代码是:
out.df <- df[which(df$outlier==1),]#Create a subset of only outlier rows
df$outlier <- ifelse(df$weeknum %in% out.df$weeknum, 1, 0)#Compare the new data frame
#weeknum against the old with the %in% operator, if they are equal leave 1, else 0.
这给出了结果:
> head(df, 17)
date dow var1 LB UB outlier weeknum
1 2015-01-01 Thursday -7828.412 -114675.6 120479.8 0 0
2 2015-01-02 Friday 25674.456 -114675.6 120479.8 0 0
3 2015-01-03 Saturday -33588.871 -114675.6 120479.8 0 0
4 2015-01-04 Sunday -54418.175 -114675.6 120479.8 0 1
5 2015-01-05 Monday -10002.002 -114675.6 120479.8 0 1
6 2015-01-06 Tuesday 34050.390 -114675.6 120479.8 0 1
7 2015-01-07 Wednesday -37584.648 -114675.6 120479.8 0 1
8 2015-01-08 Thursday 84048.878 -114675.6 120479.8 0 1
9 2015-01-09 Friday -24801.346 -114675.6 120479.8 0 1
10 2015-01-10 Saturday 33974.637 -114675.6 120479.8 0 1
11 2015-01-11 Sunday 77432.088 -114675.6 120479.8 1 2
12 2015-01-12 Monday 128196.236 -114675.6 120479.8 1 2
13 2015-01-13 Tuesday 9740.418 -114675.6 120479.8 1 2
14 2015-01-14 Wednesday 26539.887 -114675.6 120479.8 1 2
15 2015-01-15 Thursday 12172.834 -114675.6 120479.8 1 2
16 2015-01-16 Friday 1032.544 -114675.6 120479.8 1 2
17 2015-01-17 Saturday 76870.095 -114675.6 120479.8 1 2
这是令人满意的。