有一个数据框我想操作以下格式。
projectid vendor_name project_resource_type item_quantity
12345 amazon tech 5
12345 best buy supplies 2
abcde amazon tech 1
总之,我想把数据框操作到像这样的东西
projectid amazon best_buy tech supplies total_quantity
12345 1. 1. 1. 1. 7
abcde 1. 0. 0. 1. 1.
所以我做了以下
has_vendors = pd.get_dummies(resources.vendor_name, prefix='has_vendor')
resources.drop('vendor_name', 1, inplace=True)
resources = pd.merge(resources, has_vendors, left_index=True, right_index=True, how='outer')
print 'merging resource type dummies'
resources_types = pd.get_dummies(resources.project_resource_type, prefix='has_resource_type')
resources.drop('project_resource_type', 1, inplace=True)
resources = pd.merge(resources, resources_types, left_index=True, right_index=True, how='outer')
gb = resources.groupby('projectid')
columns = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols = [x for x in resources.columns.values if 'has_vendor' in x or 'has_resource_type' in x]
all_cols.append('total_quantity')
def group(x):
vals = []
for i,col in enumerate(columns):
v = np.any(x[col]) + 0.
vals.append(v)
su = np.sum(x['item_quantity'])
vals.append(su)
return pd.Series(vals, index=all_cols)
resources_agg = gb.apply(group)
事情是,我发现gb.apply(group)
函数太慢,大约有650000个唯一项目ID。还有其他方法来加速这件事吗?
答案 0 :(得分:2)
如果pivot_table
更快,你可以试试:
>>> aggrfn = lambda ts: 1 if 0 < ts.sum() else 0
>>> df.pivot_table('item_quantity', 'projectid', 'vendor_name', aggrfn, 0)
vendor_name amazon best buy
projectid
12345 1 1
abcde 1 0
>>> df.pivot_table('item_quantity', 'projectid', 'project_resource_type', aggrfn, 0)
project_resource_type supplies tech
projectid
12345 1 1
abcde 0 1
>>> df.groupby('projectid')['item_quantity'].aggregate({'total_quantity':'sum'})
total_quantity
projectid
12345 7
abcde 1
如果objs
是包含上述结果的列表,您可以将它们加入:
>>> pd.concat(objs, axis=1)
amazon best buy supplies tech total_quantity
projectid
12345 1 1 1 1 7
abcde 1 0 0 1 1