熊猫:在DateTimeIndex上过滤和移位

时间:2014-04-07 20:03:52

标签: python pandas time-series

这是my question here的后续内容。我设法完成了我想做的事情,但我没有使用索引列就这样做了。出于性能原因和其他功能,我想将我的日期时间列转换为索引,但是在尝试执行此操作时我会陷入困境。这是相关的代码:

dfs = [pd.read_csv(f, names=CSV_COLUMNS, parse_dates={'timestamp': [1, 2]}, date_parser=parse) for f in files]
df = pd.concat(dfs)
df = df[(df['timestamp'] >= start)]
df['day'] = df.timestamp.apply(lambda x: x.isoweekday())
df = df[df['day'].isin(self.days)]
df.sort(columns=['mac', 'timestamp'], inplace=True)
df['departure time'] = df['timestamp'].shift(1)
# mask out items outside our hour bounds
df['hour'] = df.timestamp.apply(lambda x: x.hour)
df['departure hour'] = df['hour'].shift(1)
# get rid of items outside our time range or with NaN values in either hour column
df = df[(df['departure hour'] >= self.time_range[0].hour) & (df['hour'] <= self.time_range[1].hour)]

当我将timestamp列放入索引时,我无法弄清楚如何处理几项:

  • 当我尝试执行departure hour的转换时,我得到&#34;无法在没有偏移的情况下转换&#34;。
  • 我也不确定如何从索引列中获取ISO工作日,但我想我可以通过使用.weekday并按1进行调整来伪造它。

我真正喜欢的是拥有timestamp列的便利性,我现在已经结合了DateTimeIndex的性能(以及进行频率分析的能力)。另外,如果我完全倒退,那也很好。

编辑:运行上面的代码后,这是数据框的(包装)视图:

                   timestamp    node                mac  day  origin  \
0 2013-09-13 16:00:13.494737  node32  00:05:4F:90:0D:C5    5  node31   
0 2013-09-13 16:00:55.084211  node33  00:05:4F:90:0D:C5    5  node32   
0 2013-09-13 16:01:37.810526  node34  00:05:4F:90:0D:C5    5  node33   
0 2013-09-13 16:02:40.336842  node35  00:05:4F:90:0D:C5    5  node34   
0 2013-09-13 16:03:50.347368  node36  00:05:4F:90:0D:C5    5  node35   

              departure time  hour  departure hour      node_key  time_elapsed  
0        2013-09-13 15:59:59    16              15  node31node32     14.494737  
0 2013-09-13 16:00:13.494737    16              16  node32node33     41.589474  
0 2013-09-13 16:00:55.084211    16              16  node33node34     42.726315  
0 2013-09-13 16:01:37.810526    16              16  node34node35     62.526316  
0 2013-09-13 16:02:40.336842    16              16  node35node36     70.010526 

1 个答案:

答案 0 :(得分:1)

FYI发布包装数据真的难以解析,最好只是粘贴并让滚动条工作。然后我可以直接复制/过去。

In [56]: df
Out[56]: 
                   timestamp    node                mac  day  origin             departure time  hour  depart_hour      node_key  time_elapsed
0 2013-09-13 16:00:13.494737  node32  00:05:4F:90:0D:C5    5  node31        2013-09-13 15:59:59    16           15  node31node32     14.494737
1 2013-09-13 16:00:55.084211  node33  00:05:4F:90:0D:C5    5  node32 2013-09-13 16:00:13.494737    16           16  node32node33     41.589474
2 2013-09-13 16:01:37.810526  node34  00:05:4F:90:0D:C5    5  node33 2013-09-13 16:00:55.084211    16           16  node33node34     42.726315
3 2013-09-13 16:02:40.336842  node35  00:05:4F:90:0D:C5    5  node34 2013-09-13 16:01:37.810526    16           16  node34node35     62.526316
4 2013-09-13 16:03:50.347368  node36  00:05:4F:90:0D:C5    5  node35 2013-09-13 16:02:40.336842    16           16  node35node36     70.010526

[5 rows x 10 columns]

In [57]: df.dtypes
Out[57]: 
timestamp         datetime64[ns]
node                      object
mac                       object
day                        int64
origin                    object
departure time    datetime64[ns]
hour                       int64
depart_hour                int64
node_key                  object
time_elapsed             float64
dtype: object

我认为你有一个正确的日期列dtypes,因为你看起来像是在读入时解析它们。

我所拥有的与您展示的内容之间的区别在于索引。当你连续, 在这种情况下执行concat(list_of_frames,ignore_index=True)因为我打赌dfs中的读取的索引从0开始,所以你想要一个唯一的连续索引。

您可以通过将系列包装在索引中然后执行索引操作来以矢量化方式执行各种操作。

In [58]: pd.Index(df['timestamp']).weekday
Out[58]: array([4, 4, 4, 4, 4])

In [59]: pd.Index(df['timestamp']).hour
Out[59]: array([16, 16, 16, 16, 16])

设置索引和转移

In [65]: df2 = df.set_index('timestamp')

In [69]: df2.shift(1)
Out[69]: 
                              node                mac  day  origin             departure time  hour  depart_hour      node_key  time_elapsed
timestamp                                                                                                                                   
2013-09-13 16:00:13.494737     NaN                NaN  NaN     NaN                        NaT   NaN          NaN           NaN           NaN
2013-09-13 16:00:55.084211  node32  00:05:4F:90:0D:C5    5  node31        2013-09-13 15:59:59    16           15  node31node32     14.494737
2013-09-13 16:01:37.810526  node33  00:05:4F:90:0D:C5    5  node32 2013-09-13 16:00:13.494737    16           16  node32node33     41.589474
2013-09-13 16:02:40.336842  node34  00:05:4F:90:0D:C5    5  node33 2013-09-13 16:00:55.084211    16           16  node33node34     42.726315
2013-09-13 16:03:50.347368  node35  00:05:4F:90:0D:C5    5  node34 2013-09-13 16:01:37.810526    16           16  node34node35     62.526316

[5 rows x 9 columns]

我不完全清楚你的问题是什么,所以也许可以编辑你的帖子。