创建由Pandas Dataframe

时间:2017-01-27 01:00:42

标签: python pandas stack pivot

我的数据框看起来像这样:

Current State

我的目标是:

Final State

说明:

  1. 每位客户都订了3个订单
  2. 可以从每个订单中的多个类别购买
  3. 期望状态:获取客户按订单顺序购买的所有类别的排列。第二张图片将有助于更好地理解这一点
  4. 所需状态中的类别1表示按第一顺序购买的类别,类别2表示按二阶购买的类别,依此类推。
  5. 代码I使用:

    start_time = time.time()
    
    df = pd.DataFrame()
    for CustomerName in base_df.CustomerName.unique():
        df1 = base_df[(base_df['CustomerName']== CustomerName)][['CustomerName','order_seq','Category']]
        df2 = pd.DataFrame(index=pd.MultiIndex.from_product([subdf['Category'] for p, subdf in df1.groupby(['order_seq'])], names = df1.order_seq.unique())).reset_index()
        df2['CustomerName'] = CustomerName
        df = df.append(df2)
    
    print("--- %s seconds ---" %(time.time() - start_time))
    

    在我的数据集上运行大约需要10分钟 - 寻找更快的方法。

    我现在正在研究Pandas,但也欢迎指向R或SQL的指针!谢谢!

2 个答案:

答案 0 :(得分:1)

考虑合并三个 OrderSequence 数据框,每个数据框都加入一个不同的 CustomerName

import pandas as pd

df = pd.DataFrame({'CustomerName': [1,1,1,1,1,1,1,2,2,2,3,3,3,3],
                   'OrderSequence': [1,2,2,2,3,3,3,1,2,3,1,1,2,3],
                   'Category': ['Food','Food','Clothes','Furniture','Clothes','Food','Toys',
                                'Clothes','Toys','Food','Furniture','Toys','Food','Food']})

finaldf = pd.DataFrame(df['CustomerName'].drop_duplicates())

for i in range(1,4):
    seqdf = df[df['OrderSequence']==i][['CustomerName', 'Category']].\               
                                      rename(columns={'Category':'Category'+str(i)})
    finaldf = pd.merge(finaldf, seqdf, on=['CustomerName'])

print(finaldf)

#     CustomerName  Category1  Category2 Category3
# 0              1       Food       Food   Clothes
# 1              1       Food       Food      Food
# 2              1       Food       Food      Toys
# 3              1       Food    Clothes   Clothes
# 4              1       Food    Clothes      Food
# 5              1       Food    Clothes      Toys
# 6              1       Food  Furniture   Clothes
# 7              1       Food  Furniture      Food
# 8              1       Food  Furniture      Toys
# 9              2    Clothes       Toys      Food
# 10             3  Furniture       Food      Food
# 11             3       Toys       Food      Food

不可否认,上面的设置首先在SQL中使用自联接进行了考虑,然后转换为pandas:

SELECT t1.CustomerName, t2.Category AS Category1, 
       t3.Category AS Category2, t4.Category AS Category3

FROM (SELECT DISTINCT CustomerName FROM DataFrame) AS t1 
INNER JOIN DataFrame AS t2 
ON t1.CustomerName = t2.CustomerName 
INNER JOIN DataFrame AS t3
ON t1.CustomerName = t3.CustomerName 
INNER JOIN DataFrame AS t4
ON t1.CustomerName = t4.CustomerName

WHERE (t2.OrderSequence=1) AND (t3.OrderSequence=2) AND (t4.OrderSequence=3);

答案 1 :(得分:0)

好。做了一些工作,但我做到了。希望它有所帮助。

import pandas as pd
import numpy as np
from itertools import combinations

df = pd.DataFrame([], columns=['CustomerName','Order Sequence','Category'])

df['CustomerName'] = [1,1,1,1,1,1,1,2,2,2,3,3,3,3]
df['Order Sequence'] = [1,2,2,2,3,3,3,1,2,3,1,1,2,3]
df['Category'] = ['Food','Food','Clothes','Furniture','Clothes','Food','Toys','Clothes','Toys','Food','Furniture','Toys','Food','Food']

df2 = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])

for CN in sorted(set(df['CustomerName'])):

    df_temp = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])

    list_OS_1 = []
    list_OS_2 = []
    list_OS_3 = []

    MMC = reduce(lambda x, y: x*y,df.loc[df['CustomerName']==CN, 'Order Sequence'].value_counts().values)

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category']:

            list_OS_1.append(CTG) 

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category']:

            list_OS_2.append(CTG) 

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category']:

            list_OS_3.append(CTG) 

    df_temp['Category1'] = list_OS_1
    df_temp['Category2'] = list_OS_2
    df_temp['Category3'] = list_OS_3
    df_temp['CustomerName'] = CN

    df2 = pd.concat([df2,df_temp],0)

print (df2)

输出:

   CustomerName  Category1  Category2 Category3
0           1.0       Food       Food   Clothes
1           1.0       Food    Clothes      Food
2           1.0       Food  Furniture      Toys
3           1.0       Food       Food   Clothes
4           1.0       Food    Clothes      Food
5           1.0       Food  Furniture      Toys
6           1.0       Food       Food   Clothes
7           1.0       Food    Clothes      Food
8           1.0       Food  Furniture      Toys
0           2.0    Clothes       Toys      Food
0           3.0  Furniture       Food      Food
1           3.0       Toys       Food      Food

ps:它不是dinamic,所以如果你添加或删除类别,它将被搞砸。 但只要它遵循你通过我的初始标准,它就会工作