我的数据框看起来像这样:
我的目标是:
说明:
代码I使用:
start_time = time.time()
df = pd.DataFrame()
for CustomerName in base_df.CustomerName.unique():
df1 = base_df[(base_df['CustomerName']== CustomerName)][['CustomerName','order_seq','Category']]
df2 = pd.DataFrame(index=pd.MultiIndex.from_product([subdf['Category'] for p, subdf in df1.groupby(['order_seq'])], names = df1.order_seq.unique())).reset_index()
df2['CustomerName'] = CustomerName
df = df.append(df2)
print("--- %s seconds ---" %(time.time() - start_time))
在我的数据集上运行大约需要10分钟 - 寻找更快的方法。
我现在正在研究Pandas,但也欢迎指向R或SQL的指针!谢谢!
答案 0 :(得分:1)
考虑合并三个 OrderSequence 数据框,每个数据框都加入一个不同的 CustomerName :
import pandas as pd
df = pd.DataFrame({'CustomerName': [1,1,1,1,1,1,1,2,2,2,3,3,3,3],
'OrderSequence': [1,2,2,2,3,3,3,1,2,3,1,1,2,3],
'Category': ['Food','Food','Clothes','Furniture','Clothes','Food','Toys',
'Clothes','Toys','Food','Furniture','Toys','Food','Food']})
finaldf = pd.DataFrame(df['CustomerName'].drop_duplicates())
for i in range(1,4):
seqdf = df[df['OrderSequence']==i][['CustomerName', 'Category']].\
rename(columns={'Category':'Category'+str(i)})
finaldf = pd.merge(finaldf, seqdf, on=['CustomerName'])
print(finaldf)
# CustomerName Category1 Category2 Category3
# 0 1 Food Food Clothes
# 1 1 Food Food Food
# 2 1 Food Food Toys
# 3 1 Food Clothes Clothes
# 4 1 Food Clothes Food
# 5 1 Food Clothes Toys
# 6 1 Food Furniture Clothes
# 7 1 Food Furniture Food
# 8 1 Food Furniture Toys
# 9 2 Clothes Toys Food
# 10 3 Furniture Food Food
# 11 3 Toys Food Food
不可否认,上面的设置首先在SQL中使用自联接进行了考虑,然后转换为pandas:
SELECT t1.CustomerName, t2.Category AS Category1,
t3.Category AS Category2, t4.Category AS Category3
FROM (SELECT DISTINCT CustomerName FROM DataFrame) AS t1
INNER JOIN DataFrame AS t2
ON t1.CustomerName = t2.CustomerName
INNER JOIN DataFrame AS t3
ON t1.CustomerName = t3.CustomerName
INNER JOIN DataFrame AS t4
ON t1.CustomerName = t4.CustomerName
WHERE (t2.OrderSequence=1) AND (t3.OrderSequence=2) AND (t4.OrderSequence=3);
答案 1 :(得分:0)
好。做了一些工作,但我做到了。希望它有所帮助。
import pandas as pd
import numpy as np
from itertools import combinations
df = pd.DataFrame([], columns=['CustomerName','Order Sequence','Category'])
df['CustomerName'] = [1,1,1,1,1,1,1,2,2,2,3,3,3,3]
df['Order Sequence'] = [1,2,2,2,3,3,3,1,2,3,1,1,2,3]
df['Category'] = ['Food','Food','Clothes','Furniture','Clothes','Food','Toys','Clothes','Toys','Food','Furniture','Toys','Food','Food']
df2 = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])
for CN in sorted(set(df['CustomerName'])):
df_temp = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])
list_OS_1 = []
list_OS_2 = []
list_OS_3 = []
MMC = reduce(lambda x, y: x*y,df.loc[df['CustomerName']==CN, 'Order Sequence'].value_counts().values)
for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category'])):
for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category']:
list_OS_1.append(CTG)
for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category'])):
for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category']:
list_OS_2.append(CTG)
for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category'])):
for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category']:
list_OS_3.append(CTG)
df_temp['Category1'] = list_OS_1
df_temp['Category2'] = list_OS_2
df_temp['Category3'] = list_OS_3
df_temp['CustomerName'] = CN
df2 = pd.concat([df2,df_temp],0)
print (df2)
输出:
CustomerName Category1 Category2 Category3
0 1.0 Food Food Clothes
1 1.0 Food Clothes Food
2 1.0 Food Furniture Toys
3 1.0 Food Food Clothes
4 1.0 Food Clothes Food
5 1.0 Food Furniture Toys
6 1.0 Food Food Clothes
7 1.0 Food Clothes Food
8 1.0 Food Furniture Toys
0 2.0 Clothes Toys Food
0 3.0 Furniture Food Food
1 3.0 Toys Food Food
ps:它不是dinamic,所以如果你添加或删除类别,它将被搞砸。 但只要它遵循你通过我的初始标准,它就会工作