Pandas groupby应用如何加速

时间:2014-07-03 11:06:36

标签: python pandas

有一个数据框我想操作以下格式。

projectid    vendor_name    project_resource_type    item_quantity
12345        amazon            tech                       5
12345        best buy          supplies                   2
abcde        amazon            tech                       1

总之,我想把数据框操作到像这样的东西

projectid    amazon best_buy tech supplies total_quantity
12345           1.      1.    1.    1.         7
abcde           1.      0.    0.    1.         1.

所以我做了以下

has_vendors = pd.get_dummies(resources.vendor_name, prefix='has_vendor')
resources.drop('vendor_name', 1, inplace=True)
resources = pd.merge(resources, has_vendors, left_index=True, right_index=True, how='outer')

print 'merging resource type dummies'
resources_types = pd.get_dummies(resources.project_resource_type, prefix='has_resource_type')
resources.drop('project_resource_type', 1, inplace=True)
resources = pd.merge(resources, resources_types, left_index=True, right_index=True, how='outer')

gb = resources.groupby('projectid')

columns = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols.append('total_quantity')

def group(x):
    vals = []
    for i,col in enumerate(columns):
        v = np.any(x[col]) + 0.
        vals.append(v)

    su = np.sum(x['item_quantity'])
    vals.append(su)

    return pd.Series(vals, index=all_cols)

resources_agg = gb.apply(group)

事情是,我发现gb.apply(group)函数太慢,大约有650000个唯一项目ID。还有其他方法来加速这件事吗?

1 个答案:

答案 0 :(得分:2)

如果pivot_table更快,你可以试试:

>>> aggrfn = lambda ts: 1 if 0 < ts.sum() else 0
>>> df.pivot_table('item_quantity', 'projectid', 'vendor_name', aggrfn, 0)
vendor_name  amazon  best buy
projectid                    
12345             1         1
abcde             1         0

>>> df.pivot_table('item_quantity', 'projectid', 'project_resource_type', aggrfn, 0)
project_resource_type  supplies  tech
projectid                            
12345                         1     1
abcde                         0     1

>>> df.groupby('projectid')['item_quantity'].aggregate({'total_quantity':'sum'})
           total_quantity
projectid                
12345                   7
abcde                   1

如果objs是包含上述结果的列表,您可以将它们加入:

>>> pd.concat(objs, axis=1)
           amazon  best buy  supplies  tech  total_quantity
projectid                                                  
12345           1         1         1     1               7
abcde           1         0         0     1               1