我正在处理包含混合类型值(timedeltas和int)的MultiIndex系列:
char
7 a 103 minutes
s 63
9 a 129 minutes
s 211
10 a 106 minutes
s 63
Name: timestamp, dtype: object
索引:
MultiIndex(levels=[[7, 9, 10], ['a', 's']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['char', None])
当我尝试使用pandas.Series.unstack()
将其取消堆叠时,它会将所有值转换为timedeltas(具有不同的精度):
a s
char
7 01:43:00 00:00:00.000000
9 02:09:00 00:00:00.000000
10 01:46:00 00:00:00.000000
任何人都知道这是从哪里来的?
修改
这是一些更多的信息。原始数据样本:
timestamp char
0 2008-01-15 23:56:52 7
1 2008-01-16 00:07:28 7
2 2008-01-01 16:12:32 9
3 2008-01-03 01:52:08 9
4 2008-07-06 17:23:25 10
5 2008-07-06 17:33:47 10
我提取了一些功能:
def get_session(ts):
ts = ts.sort_index()
dt = (ts - ts.shift()).fillna(0)
first_logs = dt > '30m'
sessions = first_logs.cumsum() + 1
duration = sessions.value_counts().mean() * np.timedelta64(10, 'm')
return pd.Series({'s': max(sessions), 'a': duration})
timetable = data.groupby('char')[' timestamp'].apply(get_session)
这给了我:
char
7 a 20 minutes
s 1
9 a 10 minutes
s 2
10 a 20 minutes
s 1
Name: timestamp, dtype: object
在被拆散之后看起来像:
timetable.unstack()
a s
char
7 00:20:00 00:00:00.000000
9 00:10:00 00:00:00.000000
10 00:20:00 00:00:00.000000
答案 0 :(得分:3)
看起来像是bug。
我认为你可以从函数DataFrame
返回,然后unstack
不是必需的:
def get_session(ts):
ts = ts.sort_index()
dt = (ts - ts.shift()).fillna(0)
first_logs = dt > '30m'
sessions = first_logs.cumsum() + 1
duration = sessions.value_counts().mean() * np.timedelta64(10, 'm')
return pd.DataFrame({'s': max(sessions), 'a': duration}, index=[0])
timetable = data.groupby('char')['timestamp'].apply(get_session)
print (timetable)
a s
char
7 0 00:20:00 1
9 0 00:10:00 2
10 0 00:20:00 1
但是索引存在问题(第二级都是0
),因此您可以从列s
创建索引,然后通过rename_axis
设置索引name
(new在pandas
0.18.0
)中:
def get_session(ts):
ts = ts.sort_index()
dt = (ts - ts.shift()).fillna(0)
first_logs = dt > '30m'
sessions = first_logs.cumsum() + 1
duration = sessions.value_counts().mean() * np.timedelta64(10, 'm')
return pd.DataFrame({'a': duration}, index=[max(sessions)]).rename_axis('s')
timetable = data.groupby('char')['timestamp'].apply(get_session)
print (timetable)
a
char s
7 1 00:20:00
9 2 00:10:00
10 1 00:20:00