Pandas适用于数据帧组

时间:2013-12-19 20:34:52

标签: python pandas

如果我分组(下面的g对象),然后将以下函数应用到前1000行df,它就可以了。但如果我将它应用于整个df,我会得到这个例外:

    def calc_load(x):
         ...:     x.sort('log_timestamp')
         ...:     x['time_stddev'] = x['time'].std()
         ...:     x['time_mean'] = x['time'].mean()
         ...:     return x
         ...:


    c=g.apply(calc_load)
    ---------------------------------------------------------------------------
    ........

    ValueError                                Traceback (most recent call last)
    <ipython-input-262-f2fe1f013907> in <module>()
    ----> 1 c=g.apply(calc_load)
       2215             tuple(map(int, [tot_items] + list(block_shape))),
    -> 2216             tuple(map(int, [len(ax) for ax in axes]))))
       2217
       2218

    ValueError: Shape of passed values is (10, 3943482), indices imply (10, 410450)

这里的原因是什么?如何解决?

更新

我正在从HDF5商店阅读此表:

prob2
Out[374]:
<class 'pandas.io.pytables.HDFStore'>
File path: /tmp/test2.h5
/mytable            frame_table  (typ->appendable,nrows->410450,ncols->8,indexers->[index])

a=prob2.mytable

a
Out[376]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 9999
Data columns (total 8 columns):
args             410450  non-null values
host             410450  non-null values
kwargs           410450  non-null values
log_timestamp    410450  non-null values
operation        410450  non-null values
slot             410450  non-null values
status           410450  non-null values
time             410450  non-null values
dtypes: float64(1), int64(2), object(5)

如果我像下面这样往返于CSV,则不会发生异常:

a.to_csv('/tmp/test2.csv')

b=pd.read_csv('/tmp/test2.csv')

b
Out[379]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 9 columns):
Unnamed: 0       410450  non-null values
args             410450  non-null values
host             410450  non-null values
kwargs           410450  non-null values
log_timestamp    410450  non-null values
operation        410450  non-null values
slot             410450  non-null values
status           410450  non-null values
time             410450  non-null values
dtypes: float64(1), int64(3), object(5)

bg = b.groupby(['host','operation'])

bg.apply(calc_load)
Out[381]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 11 columns):
Unnamed: 0       410450  non-null values
args             410450  non-null values
host             410450  non-null values
kwargs           410450  non-null values
log_timestamp    410450  non-null values
operation        410450  non-null values
slot             410450  non-null values
status           410450  non-null values
time             410450  non-null values
time_stddev      410371  non-null values
time_mean        410450  non-null values
dtypes: float64(3), int64(3), object(5)

往返(a)之前和往返(b)之后的数据帧看起来相似,但它们并不相同!

a
Out[386]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 9999
Data columns (total 8 columns):
args             410450  non-null values
host             410450  non-null values
kwargs           410450  non-null values
log_timestamp    410450  non-null values
operation        410450  non-null values
slot             410450  non-null values
status           410450  non-null values
time             410450  non-null values
dtypes: float64(1), int64(2), object(5)



b
Out[387]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 410450 entries, 0 to 410449
Data columns (total 9 columns):
Unnamed: 0       410450  non-null values
args             410450  non-null values
host             410450  non-null values
kwargs           410450  non-null values
log_timestamp    410450  non-null values
operation        410450  non-null values
slot             410450  non-null values
status           410450  non-null values
time             410450  non-null values
dtypes: float64(1), int64(3), object(5)
呃,这里发生了什么?

1 个答案:

答案 0 :(得分:4)

按主机/操作分组后,您有许多重复项。这就是为什么前1000行探测有效,但整套没有。

首先重置索引,然后分组并应用。您可以通过在结尾处设置索引来恢复原始索引。重置索引变成一个名为'index'的列(然后set_index会丢弃)。

这实际上是一种相当常见的模式。我认为可能会有更有用的错误消息,请参阅here。因为我不确定groupby应该自动修复它(它可以)。因为这可能是用户错误或意图。

In [26]: df = d.reset_index().groupby(['host','operation']).apply(calc_load).set_index('index')

In [27]: df
Out[27]: 
      args            host kwargs        log_timestamp             operation      slot  status      time  time_stddev  time_mean
index                                                                                                                           
0       []   yy3.segm1.org     {}  1385984306000000000      x_gWidgboxParams    a12yy3    -101  0.000477     0.061657   0.003226
1       []  yy14.segm1.org     {}  1385984306000000000         x_initWidgbox   a11yy14       1  0.004177     0.035759   0.005816
10      []  yy32.segm1.org     {}  1385984307000000000             gSettings   a13yy32    -101  0.009686     0.245170   0.070137
100     []  yy19.segm1.org     {}  1385984308000000000  notifyTestsDelivered   a16yy19       1  0.000766     0.002825   0.000964
1000    []   yy7.segm1.org     {}  1385984320000000000           addWidging2    a12yy7       1  0.002576     0.008525   0.004122
10000   []  yy14.segm1.org     {}  1385984461000000000           addWidging2   a13yy14       1  0.001317     0.009431   0.003910
10001   []  yy14.segm1.org     {}  1385984461000000000               gxyzinf   a13yy14    -101  0.000542     0.001861   0.001074
10002   []  yy20.segm1.org     {}  1385984461000000000               x_gbinf  I502yy20    -101  0.000522     0.001043   0.000743
10003   []  yy20.segm1.org     {}  1385984461000000000       setFlagsOneWidg  I502yy20       1  0.001660     0.005404   0.002910
10004   []  yy14.segm1.org     {}  1385984461000000000  notifyTestsDelivered   a13yy14       1  0.000551     0.002877   0.001156
10005   []  yy20.segm1.org     {}  1385984461000000000               gxyzinf  I502yy20    -101  0.000521     0.000802   0.000813
10006   []  yy14.segm1.org     {}  1385984461000000000           addWidging2   a13yy14       1  0.001256     0.009431   0.003910
10007   []  yy14.segm1.org     {}  1385984461000000000               gxyzinf   a13yy14    -101  0.000414     0.001861   0.001074
10008   []  yy14.segm1.org     {}  1385984461000000000           addWidging2   a13yy14       1  0.001222     0.009431   0.003910
10009   []  yy14.segm1.org     {}  1385984461000000000               gxyzinf   a13yy14    -101  0.000475     0.001861   0.001074
1001    []   yy7.segm1.org     {}  1385984320000000000               gxyzinf    a12yy7    -101  0.000783     0.003059   0.001004
10010   []  yy14.segm1.org     {}  1385984461000000000         x_initWidgbox   a12yy14       1  0.002764     0.035759   0.005816
10011   []  yy32.segm1.org     {}  1385984461000000000         x_initWidgbox   a15yy32       1  0.057966     0.334923   0.147668
10012   []   yy3.segm1.org     {}  1385984461000000000             gSettings    a11yy3    -101  0.006519     0.163707   0.017649
10013   []  yy30.segm1.org     {}  1385984461000000000                gtfull   a13yy30    -101  0.003648     0.116366   0.014088
10014   []   yy6.segm1.org     {}  1385984461000000000               x_gbinf    a16yy6    -101  0.000621     0.005796   0.001139
10015   []  yy34.segm1.org     {}  1385984461000000000                gtfull   a14yy34    -101  0.002031     0.015581   0.007747
10016   []  yy34.segm1.org     {}  1385984461000000000               x_gbinf   a14yy34    -101  0.000546     0.002596   0.001899
10017   []  yy34.segm1.org     {}  1385984461000000000       setFlagsOneWidg   a14yy34       1  0.001358     0.003515   0.005866
10018   []  yy34.segm1.org     {}  1385984461000000000               gxyzinf   a14yy34    -101  0.000486     0.004446   0.002018
10019   []  yy25.segm1.org     {}  1385984461000000000                gtfull   a13yy25    -101  0.002029     0.001793   0.002355
1002    []   yy7.segm1.org     {}  1385984320000000000  notifyTestsDelivered    a12yy7       1  0.000847     0.003748   0.001081
10020   []  yy32.segm1.org     {}  1385984462000000000             gFolderId   a15yy32    -101  0.018326     0.187434   0.058200
10021   []  yy25.segm1.org     {}  1385984462000000000               x_gbinf   a13yy25    -101  0.000589     0.001716   0.000830
10022   []  yy25.segm1.org     {}  1385984462000000000            updateWidg   a13yy25       1  0.003058     0.004660   0.003973
10023   []  yy25.segm1.org     {}  1385984462000000000            clearElems   a13yy25       1  0.000661     0.004893   0.001687
10024   []  yy10.segm1.org     {}  1385984462000000000                gtfull   a18yy10    -101  0.002779     0.069679   0.007495
10025   []  yy13.segm1.org     {}  1385984462000000000                gtfull   a11yy13    -101  0.001978     0.124069   0.012524
10026   []  yy32.segm1.org     {}  1385984462000000000               x_gbinf   a14yy32    -101  0.018674     0.190657   0.058083
10027   []  yy10.segm1.org     {}  1385984462000000000               x_gbinf   a18yy10    -101  0.000874     0.007170   0.001606
10028   []  yy32.segm1.org     {}  1385984462000000000               gWidgId   a14yy32       1  0.014523     1.518315   0.559983
10029   []  yy13.segm1.org     {}  1385984462000000000               x_gbinf   a11yy13    -101  0.000577     0.008605   0.001130
1003    []   yy7.segm1.org     {}  1385984320000000000      x_gWidgboxParams    a12yy7    -101  0.000933     0.001084   0.001442
10030   []  yy13.segm1.org     {}  1385984462000000000       setFlagsOneWidg   a11yy13       1  0.001611     0.011409   0.004093
10031   []  yy13.segm1.org     {}  1385984462000000000               gxyzinf   a11yy13    -101  0.000575     0.053991   0.003044
10032   []  yy39.segm1.org     {}  1385984462000000000                gtfull   a13yy39    -101  0.002005     0.034577   0.003504
10033   []  yy39.segm1.org     {}  1385984462000000000               x_gbinf   a13yy39    -101  0.000539     0.001371   0.000931
10034   []  yy32.segm1.org     {}  1385984462000000000           addWidging2   a15yy32       1  0.122369     1.414068   0.441565
10035   []  yy32.segm1.org     {}  1385984462000000000           moveOneWidg   a12yy32       1  0.468481     1.303089   0.665778
10036   []  yy32.segm1.org     {}  1385984462000000000               gxyzinf   a15yy32    -101  0.018006     0.155379   0.040389
10037   []  yy32.segm1.org     {}  1385984462000000000  notifyTestsDelivered   a15yy32       1  0.006874     0.129650   0.032741
10038   []  yy32.segm1.org     {}  1385984462000000000               gxyzinf   a12yy32    -101  0.016607     0.155379   0.040389
10039   []  yy39.segm1.org     {}  1385984462000000000            updateWidg   a13yy39       1  0.003879     0.005466   0.006465
1004    []  yy34.segm1.org     {}  1385984320000000000                gtfull   a11yy34    -101  0.003681     0.015581   0.007747
10040   []  yy39.segm1.org     {}  1385984462000000000                SELECT   a13yy39  217831  0.000423     0.000126   0.000551
10041   []  yy39.segm1.org     {}  1385984462000000000            clearElems   a13yy39       1  0.000705     0.002367   0.001356
10042   []   yy3.segm1.org     {}  1385984462000000000           moveOneWidg    a15yy3       1  0.002660     0.027428   0.009078
10043   []   yy3.segm1.org     {}  1385984462000000000               gxyzinf    a15yy3    -101  0.000436     0.041627   0.001913
10044   []  yy39.segm1.org     {}  1385984462000000000             gSettings   a11yy39    -101  0.002237     0.007467   0.002679
10045   []  yy32.segm1.org     {}  1385984462000000000             gSettings   a15yy32    -101  0.012113     0.245170   0.070137
10046   []  yy32.segm1.org     {}  1385984462000000000      x_gWidgboxParams   a15yy32    -101  0.030427     0.143941   0.050055
10047   []  yy13.segm1.org     {}  1385984462000000000           moveOneWidg   a12yy13       1  0.003796     0.117085   0.017910
10048   []  yy13.segm1.org     {}  1385984462000000000               gxyzinf   a12yy13    -101  0.000521     0.053991   0.003044
10049   []  yy30.segm1.org     {}  1385984462000000000      x_gWidgboxParams   a13yy30    -101  0.002451     0.051829   0.003644
1005    []  yy12.segm1.org     {}  1385984320000000000                gtfull   a15yy12    -101  0.003428     0.005479   0.003063
       ...             ...    ...                  ...                   ...       ...     ...       ...          ...        ...

[410450 rows x 10 columns]