Python和Pandas新秀在这里!我试图使用for循环转置包含一百万条记录的数据帧。你可以想象,它的速度非常慢。 请参阅下面的我的流程和代码。
我正在使用两个数据框: 交易 - 包含customer_id,以及他们从中购买的类别。
transactions=pandas.DataFrame({'a':['johnny','sally','maggy','lassy','johnny','sally','maggy'],
'category':['fruits','fruits','spices','veggies','veggies','spices','snacks']})
category_list - 包含客户可以购买的所有类别。
category_list=pandas.DataFrame({'category':['fruits','spices','veggies','snacks','drinks','alcohol','adult']})
对于每个客户,如果客户(曾经)在给定类别中进行了购买,则分配值1.如果不是,则将值赋值为0.
代码:
cust_list = transactions['a'].unique()
final_data = pandas.DataFrame()
for i in cust_list:
step1 = transactions[transactions.a == i]
step1 = step1.drop_duplicates()
step1['value'] = 1
cat_merge = pandas.merge(step1, category_list, how='right', left_on='category', right_on='category')
cat_merge['a'] = i
cat_merge = cat_merge.fillna(0)
cat_merge_transpose = pandas.DataFrame(cat_merge.transpose())
cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
cat_merge_transpose.columns = cat_merge_transpose.iloc[0]
cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
cat_merge_transpose.reset_index()
cat_merge_transpose.insert(0, 'a', i)
final_data = final_data.append(pandas.DataFrame(data = cat_merge_transpose), ignore_index=True)
因此,在这种情况下,结果将如下所示:
print final_data
我可以获得任何帮助来优化它并使其运行速度明显更快,代码更少的代码将非常受欢迎。
谢谢。
答案 0 :(得分:2)
您的问题可以视为数据透视操作,我们可以使用pivot_table
:
>>> df["value"] = 1
>>> P = df.pivot_table(index="a", columns="category", values="value", aggfunc=max)
>>> P.loc[:,category_list.category.unique()].fillna(0)
category fruits spices veggies snacks drinks alcohol adult
a
johnny 1 0 1 0 0 0 0
lassy 0 0 1 0 0 0 0
maggy 0 1 0 1 0 0 0
sally 1 1 0 0 0 0 0
pivot_table
本身给了我们
>>> P
category fruits snacks spices veggies
a
johnny 1 NaN NaN 1
lassy NaN NaN NaN 1
maggy NaN 1 1 NaN
sally 1 NaN 1 NaN
然后我们使用所有类别列(包括未看到的列)对其进行索引,调用fillna
将NaN替换为0。
答案 1 :(得分:1)
# Get a unique list of all category items.
categories = category_list.category.unique().tolist()
# For transactions with a given customer matching any category, assign a value of one.
transactions['value'] = transactions.groupby('a').category.transform(
lambda s: s.isin(categories).any()).astype(int)
output = transactions.groupby(['a', 'category']).max().unstack().fillna(0)
output.columns = output.columns.droplevel()
zero_cols = [c for c in categories if c not in output]
for col in zero_cols:
output[col] = 0
>>> output
category fruits snacks spices veggies drinks alcohol adult
a
johnny 1 0 0 1 0 0 0
lassy 0 0 0 1 0 0 0
maggy 0 1 1 0 0 0 0
sally 1 0 1 0 0 0 0