对于Loop替代Pandas Python

时间:2016-02-01 01:37:22

标签: python for-loop pandas

Python和Pandas新秀在这里!我试图使用for循环转置包含一百万条记录的数据帧。你可以想象,它的速度非常慢。 请参阅下面的我的流程和代码。

我正在使用两个数据框: 交易 - 包含customer_id,以及他们从中购买的类别。

transactions=pandas.DataFrame({'a':['johnny','sally','maggy','lassy','johnny','sally','maggy'],
'category':['fruits','fruits','spices','veggies','veggies','spices','snacks']})

category_list - 包含客户可以购买的所有类别。

category_list=pandas.DataFrame({'category':['fruits','spices','veggies','snacks','drinks','alcohol','adult']})

对于每个客户,如果客户(曾经)在给定类别中进行了购买,则分配值1.如果不是,则将值赋值为0.

代码:

cust_list = transactions['a'].unique()
final_data = pandas.DataFrame()

for i in cust_list:
    step1 = transactions[transactions.a == i]
    step1 = step1.drop_duplicates()
    step1['value'] = 1
    cat_merge = pandas.merge(step1, category_list, how='right', left_on='category', right_on='category')
    cat_merge['a'] = i
    cat_merge = cat_merge.fillna(0)
    cat_merge_transpose = pandas.DataFrame(cat_merge.transpose())
    cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
    cat_merge_transpose.columns = cat_merge_transpose.iloc[0]
    cat_merge_transpose = cat_merge_transpose.drop(cat_merge_transpose.index[0])
    cat_merge_transpose.reset_index()
    cat_merge_transpose.insert(0, 'a', i)
    final_data = final_data.append(pandas.DataFrame(data = cat_merge_transpose), ignore_index=True)

因此,在这种情况下,结果将如下所示:

print final_data

我可以获得任何帮助来优化它并使其运行速度明显更快,代码更少的代码将非常受欢迎。

谢谢。

2 个答案:

答案 0 :(得分:2)

您的问题可以视为数据透视操作,我们可以使用pivot_table

>>> df["value"] = 1
>>> P = df.pivot_table(index="a", columns="category", values="value", aggfunc=max)
>>> P.loc[:,category_list.category.unique()].fillna(0)
category  fruits  spices  veggies  snacks  drinks  alcohol  adult
a                                                                
johnny         1       0        1       0       0        0      0
lassy          0       0        1       0       0        0      0
maggy          0       1        0       1       0        0      0
sally          1       1        0       0       0        0      0

pivot_table本身给了我们

>>> P
category  fruits  snacks  spices  veggies
a                                        
johnny         1     NaN     NaN        1
lassy        NaN     NaN     NaN        1
maggy        NaN       1       1      NaN
sally          1     NaN       1      NaN

然后我们使用所有类别列(包括未看到的列)对其进行索引,调用fillna将NaN替换为0。

答案 1 :(得分:1)

# Get a unique list of all category items.
categories = category_list.category.unique().tolist()

# For transactions with a given customer matching any category, assign a value of one.
transactions['value'] = transactions.groupby('a').category.transform(
                            lambda s: s.isin(categories).any()).astype(int)
output = transactions.groupby(['a', 'category']).max().unstack().fillna(0)
output.columns = output.columns.droplevel()
zero_cols = [c for c in categories if c not in output]
for col in zero_cols:
    output[col] = 0
>>> output
category  fruits  snacks  spices  veggies  drinks  alcohol  adult
a                                                                
johnny         1       0       0        1       0        0      0
lassy          0       0       0        1       0        0      0
maggy          0       1       1        0       0        0      0
sally          1       0       1        0       0        0      0