HDFStore中的大数据上的“分组依据”多列

时间:2014-09-26 21:33:04

标签: python pandas hdf5

Pandas "Group By" Query on Large Data in HDFStore?

我在答案中尝试了这个例子,但我希望能够按两列分组。

基本上,将代码修改为

with pd.get_store(fname) as store:
    store.append('df',df,data_columns=['A','B','C'])
    print "store:\n%s" % store

    print "\ndf:\n%s" % store['df']

   # get the groups
   groups = store.select_column('df',['A', 'B']).unique()
   print "\ngroups:%s" % groups

我尝试了多种方法来选择A列和B列,但无法使其正常工作。

引发错误KeyError:"列[[' A',' B']]未在表格中找到"

支持吗?

由于

1 个答案:

答案 0 :(得分:2)

store.select_column(...)仅选择一个SINGLE列。

略微修改链接的原始代码:

import numpy as np
import pandas as pd
import os

fname = 'groupby.h5'

# create a frame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                   'B': [1,1,1,2,
                         1,1,1,2,
                         2,2,1],
                   'C': ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                   'D': np.random.randn(11),
                   'E': np.random.randn(11),
                   'F': np.random.randn(11)})


# create the store and append, using data_columns where I possibily
# could aggregate
with pd.get_store(fname,mode='w') as store:
    store.append('df',df,data_columns=['A','B','C'])

    print "\ndf:\n%s" % store['df']

    # get the groups
    A = store.select_column('df','A')
    B = store.select_column('df','B')
    idx = pd.MultiIndex.from_arrays([A,B])
    groups = idx.unique()

    # iterate over the groups and apply my operations
    l = []
    for (a,b) in groups:

        grp = store.select('df',where = [ 'A=%s and B=%s' % (a,b) ])

        # this is a regular frame, aggregate however you would like
        l.append(grp[['D','E','F']].sum())

print "\nresult:\n%s" % pd.concat(l, keys = groups)

os.remove(fname)

以下是结果

起始帧(与原始示例不同,因为B列现在是整数,仅为了清晰起见)

df:
      A  B      C         D         E         F
0   foo  1   dull  0.993672 -0.889936  0.300826
1   foo  1   dull -0.708760 -1.121964 -1.339494
2   foo  1  shiny -0.606585 -0.345783  0.734747
3   foo  2   dull -0.818121 -0.187682 -0.258820
4   bar  1   dull -0.612097 -0.588711  1.417523
5   bar  1  shiny -0.591513  0.661931  0.337610
6   bar  1  shiny -0.974495  0.347694 -1.100550
7   bar  2   dull  1.888711  1.824509 -0.635721
8   foo  2  shiny  0.715446 -0.540150  0.789633
9   foo  2  shiny -0.262954  0.957464 -0.042694
10  foo  1  shiny  0.193822 -0.241079 -0.478291

独特的群体。我们选择了需要独立分组的每一列,然后获取结果索引并构建一个多索引。这些是生成的多索引的唯一组。

groups:[('foo', 1) ('foo', 2) ('bar', 1) ('bar', 2)]

最终结果。

result:
foo  1  D   -0.127852
        E   -2.598762
        F   -0.782213
     2  D   -0.365629
        E    0.229632
        F    0.488119
bar  1  D   -2.178105
        E    0.420914
        F    0.654583
     2  D    1.888711
        E    1.824509
        F   -0.635721
dtype: float64