这是my question here的后续内容。我设法完成了我想做的事情,但我没有使用索引列就这样做了。出于性能原因和其他功能,我想将我的日期时间列转换为索引,但是在尝试执行此操作时我会陷入困境。这是相关的代码:
dfs = [pd.read_csv(f, names=CSV_COLUMNS, parse_dates={'timestamp': [1, 2]}, date_parser=parse) for f in files]
df = pd.concat(dfs)
df = df[(df['timestamp'] >= start)]
df['day'] = df.timestamp.apply(lambda x: x.isoweekday())
df = df[df['day'].isin(self.days)]
df.sort(columns=['mac', 'timestamp'], inplace=True)
df['departure time'] = df['timestamp'].shift(1)
# mask out items outside our hour bounds
df['hour'] = df.timestamp.apply(lambda x: x.hour)
df['departure hour'] = df['hour'].shift(1)
# get rid of items outside our time range or with NaN values in either hour column
df = df[(df['departure hour'] >= self.time_range[0].hour) & (df['hour'] <= self.time_range[1].hour)]
当我将timestamp
列放入索引时,我无法弄清楚如何处理几项:
departure hour
的转换时,我得到&#34;无法在没有偏移的情况下转换&#34;。 .weekday
并按1进行调整来伪造它。我真正喜欢的是拥有timestamp
列的便利性,我现在已经结合了DateTimeIndex的性能(以及进行频率分析的能力)。另外,如果我完全倒退,那也很好。
编辑:运行上面的代码后,这是数据框的(包装)视图:
timestamp node mac day origin \
0 2013-09-13 16:00:13.494737 node32 00:05:4F:90:0D:C5 5 node31
0 2013-09-13 16:00:55.084211 node33 00:05:4F:90:0D:C5 5 node32
0 2013-09-13 16:01:37.810526 node34 00:05:4F:90:0D:C5 5 node33
0 2013-09-13 16:02:40.336842 node35 00:05:4F:90:0D:C5 5 node34
0 2013-09-13 16:03:50.347368 node36 00:05:4F:90:0D:C5 5 node35
departure time hour departure hour node_key time_elapsed
0 2013-09-13 15:59:59 16 15 node31node32 14.494737
0 2013-09-13 16:00:13.494737 16 16 node32node33 41.589474
0 2013-09-13 16:00:55.084211 16 16 node33node34 42.726315
0 2013-09-13 16:01:37.810526 16 16 node34node35 62.526316
0 2013-09-13 16:02:40.336842 16 16 node35node36 70.010526
答案 0 :(得分:1)
FYI发布包装数据真的难以解析,最好只是粘贴并让滚动条工作。然后我可以直接复制/过去。
In [56]: df
Out[56]:
timestamp node mac day origin departure time hour depart_hour node_key time_elapsed
0 2013-09-13 16:00:13.494737 node32 00:05:4F:90:0D:C5 5 node31 2013-09-13 15:59:59 16 15 node31node32 14.494737
1 2013-09-13 16:00:55.084211 node33 00:05:4F:90:0D:C5 5 node32 2013-09-13 16:00:13.494737 16 16 node32node33 41.589474
2 2013-09-13 16:01:37.810526 node34 00:05:4F:90:0D:C5 5 node33 2013-09-13 16:00:55.084211 16 16 node33node34 42.726315
3 2013-09-13 16:02:40.336842 node35 00:05:4F:90:0D:C5 5 node34 2013-09-13 16:01:37.810526 16 16 node34node35 62.526316
4 2013-09-13 16:03:50.347368 node36 00:05:4F:90:0D:C5 5 node35 2013-09-13 16:02:40.336842 16 16 node35node36 70.010526
[5 rows x 10 columns]
In [57]: df.dtypes
Out[57]:
timestamp datetime64[ns]
node object
mac object
day int64
origin object
departure time datetime64[ns]
hour int64
depart_hour int64
node_key object
time_elapsed float64
dtype: object
我认为你有一个正确的日期列dtypes,因为你看起来像是在读入时解析它们。
我所拥有的与您展示的内容之间的区别在于索引。当你连续,
在这种情况下执行concat(list_of_frames,ignore_index=True)
因为我打赌dfs中的读取的索引从0开始,所以你想要一个唯一的连续索引。
您可以通过将系列包装在索引中然后执行索引操作来以矢量化方式执行各种操作。
In [58]: pd.Index(df['timestamp']).weekday
Out[58]: array([4, 4, 4, 4, 4])
In [59]: pd.Index(df['timestamp']).hour
Out[59]: array([16, 16, 16, 16, 16])
设置索引和转移
In [65]: df2 = df.set_index('timestamp')
In [69]: df2.shift(1)
Out[69]:
node mac day origin departure time hour depart_hour node_key time_elapsed
timestamp
2013-09-13 16:00:13.494737 NaN NaN NaN NaN NaT NaN NaN NaN NaN
2013-09-13 16:00:55.084211 node32 00:05:4F:90:0D:C5 5 node31 2013-09-13 15:59:59 16 15 node31node32 14.494737
2013-09-13 16:01:37.810526 node33 00:05:4F:90:0D:C5 5 node32 2013-09-13 16:00:13.494737 16 16 node32node33 41.589474
2013-09-13 16:02:40.336842 node34 00:05:4F:90:0D:C5 5 node33 2013-09-13 16:00:55.084211 16 16 node33node34 42.726315
2013-09-13 16:03:50.347368 node35 00:05:4F:90:0D:C5 5 node34 2013-09-13 16:01:37.810526 16 16 node34node35 62.526316
[5 rows x 9 columns]
我不完全清楚你的问题是什么,所以也许可以编辑你的帖子。