为什么子集丢失行?

时间:2019-02-08 19:32:51

标签: r subset

我的数据框如下:

            Date       Time Consumption  kVARh      kW weekday
2     2016-12-13    0:15:00      90.144  0.000 360.576     Tue
3     2016-12-13    0:30:00      90.144  0.000 360.576     Tue
4     2016-12-13    0:45:00      91.584  0.000 366.336     Tue
5     2016-12-13    1:00:00      93.888  0.000 375.552     Tue
6     2016-12-13    1:15:00      88.416  0.000 353.664     Tue
7     2016-12-13    1:30:00      88.704  0.000 354.816     Tue
8     2016-12-13    1:45:00      91.296  0.000 365.184     Tue

我从日期为因子的csv获取数据,我将其更改为as.character,然后更改为as.date。然后我添加了一个列,给了我使用星期几的信息

sigEx1DF$weekday <- format(as.Date(sigEx1DF$Date), "%a")

然后从星期日到星期六将其转换为有序因子。

这是来自智能仪表的细粒度数据,该仪表每隔15分钟测量一次使用(消耗)。 kWConsumption*4。我需要在每个工作日取平均值,然后获得平均值的最大值,但是当我对数据框进行子集设置时,会像这样:

            Date     Time Consumption  kVARh      kW weekday
3     2016-12-13  0:30:00      90.144  0.000 360.576     Tue
8     2016-12-13  1:45:00      91.296  0.000 365.184     Tue
13    2016-12-13  3:00:00      93.600  0.000 374.400     Tue
18    2016-12-13  4:15:00      93.312  0.000 373.248     Tue
23    2016-12-13  5:30:00     107.424  0.000 429.696     Tue
28    2016-12-13  6:45:00     103.968  0.000 415.872     Tue
33    2016-12-13  8:00:00     108.576  0.000 434.304     Tue

现在缺少15分钟间隔中的几个间隔(例如,第4-7行)。我没有看到第4-7行有什么区别,但是在子集之后它们却不见了。

这是我用来子集的代码:

bldg1_Wkdy <- subset(sort.df, weekday == c("Mon","Tue","Wed","Thu","Fri"), 
select = c("Date","Time","Consumption","kVARh","kW","weekday"))

这是子集之前的数据帧结构:

'data.frame':   72888 obs. of  6 variables:
 $ Date       : Date, format: "2016-12-13" "2016-12-13" "2016-12-13" ...
 $ Time       : Factor w/ 108 levels "0:00:00","0:15:00",..: 2 3 4 5 6 7 8 49 50 51 ...
 $ Consumption: num  90.1 90.1 91.6 93.9 88.4 ...
 $ kVARh      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ kW         : num  361 361 366 376 354 ...
 $ weekday    : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...

我从工作日的72888个观察值减少到平日的10427个观察值,而周末的10368个观察值,如上所述,有许多行似乎是随机丢失的。某些间隔的功耗为零(由于风暴或其他原因,电力可能已经耗尽),但实际上这些间隔出现在子集数据中。因此看来零不是造成问题的原因。感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

您应该使用weekday == c("Mon","Tue","Wed","Thu","Fri")而不是weekday %in% c("Mon","Tue","Wed","Thu","Fri"),请参阅下面的最小测试,该测试可以显示%in%的工作原理:

> subset(x, weekday == c("Mon","Tue","Wed","Thu","Fri"))
   weekday
NA    <NA>
> subset(x, weekday %in% c("Mon","Tue","Wed","Thu","Fri"))
  weekday
1     Tue