Pandas.Series.unstack()是否会影响数据类型?

时间:2016-07-01 07:55:59

标签: python pandas

我正在处理包含混合类型值(timedeltas和int)的MultiIndex系列:

char   
7     a    103 minutes
      s             63
9     a    129 minutes
      s            211
10    a    106 minutes
      s             63
Name:  timestamp, dtype: object

索引:

MultiIndex(levels=[[7, 9, 10], ['a', 's']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['char', None])

当我尝试使用pandas.Series.unstack()将其取消堆叠时,它会将所有值转换为timedeltas(具有不同的精度):

    a           s
char        
7   01:43:00    00:00:00.000000
9   02:09:00    00:00:00.000000
10  01:46:00    00:00:00.000000

任何人都知道这是从哪里来的?

修改

这是一些更多的信息。原始数据样本:

    timestamp           char
0   2008-01-15 23:56:52 7
1   2008-01-16 00:07:28 7
2   2008-01-01 16:12:32 9
3   2008-01-03 01:52:08 9
4   2008-07-06 17:23:25 10
5   2008-07-06 17:33:47 10

我提取了一些功能:

def get_session(ts):
    ts = ts.sort_index()
    dt = (ts - ts.shift()).fillna(0)
    first_logs = dt > '30m'
    sessions = first_logs.cumsum() + 1
    duration = sessions.value_counts().mean() * np.timedelta64(10, 'm')
    return pd.Series({'s': max(sessions), 'a': duration})

timetable = data.groupby('char')[' timestamp'].apply(get_session)

这给了我:

char   
7     a    20 minutes
      s             1
9     a    10 minutes
      s             2
10    a    20 minutes
      s             1
Name:  timestamp, dtype: object

在被拆散之后看起来像:

timetable.unstack()

a   s
char        
7   00:20:00    00:00:00.000000
9   00:10:00    00:00:00.000000
10  00:20:00    00:00:00.000000

1 个答案:

答案 0 :(得分:3)

看起来像是bug。

我认为你可以从函数DataFrame返回,然后unstack不是必需的:

def get_session(ts):
    ts = ts.sort_index()
    dt = (ts - ts.shift()).fillna(0)
    first_logs = dt > '30m'
    sessions = first_logs.cumsum() + 1
    duration = sessions.value_counts().mean() * np.timedelta64(10, 'm')
    return pd.DataFrame({'s': max(sessions), 'a': duration}, index=[0])

timetable = data.groupby('char')['timestamp'].apply(get_session)
print (timetable)
              a  s
char              
7    0 00:20:00  1
9    0 00:10:00  2
10   0 00:20:00  1

但是索引存在问题(第二级都是0),因此您可以从列s创建索引,然后通过rename_axis设置索引name(new在pandas 0.18.0)中:

def get_session(ts):
    ts = ts.sort_index()
    dt = (ts - ts.shift()).fillna(0)
    first_logs = dt > '30m'
    sessions = first_logs.cumsum() + 1
    duration = sessions.value_counts().mean() * np.timedelta64(10, 'm')
    return pd.DataFrame({'a': duration}, index=[max(sessions)]).rename_axis('s')

timetable = data.groupby('char')['timestamp'].apply(get_session)
print (timetable)
              a
char s         
7    1 00:20:00
9    2 00:10:00
10   1 00:20:00