python,对大熊猫Dataframe的操作

时间:2014-09-19 07:41:06

标签: python pandas bigdata dataframe

我有一个名为Joined with 5 fields的pandas DataFrame:

product | price | percentil_25 | percentil_50 | percentile_75

对于每一行,我想按如下方式对价格进行分类:

如果价格低于百分位数_25我给这个产品1级,依此类推

所以我做的是:

classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []

for index in range(len(joined)):
    classe_final['sku'].append(joined.values[index][0])
    if(float(joined.values[index][1]) <= float(joined.values[index][2])):
        classe_final['class'].append(1)
    elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
        classe_final['class'].append(2)
    elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
        classe_final['class'].append(3)
    else:
        classe_final['class'].append(4)

但是由于我的DataFrame非常庞大,所以它会永远消失。

你知道我怎么能更快地做到这一点吗?

2 个答案:

答案 0 :(得分:0)

# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()

2种方式,定义一个函数并调用apply

def class(x):
    if x.price < x.percentil_25:
        return 1
    elif x.price >= x.percentil_25 and x.price < x.percentil_50:
        return 2:
    elif x.price >= x.percentil_50 and x.price < x.percentil_75:
        return 2:
    elif x.price >= x.percentil_75:
        return 4

df['class'] = other_df.apply(lambda row: class(row'), axis=1)
另一种我认为更好并且速度更快的方法是我们可以将'class'列添加到您现有的df并使用loc,然后只查看感兴趣的2列:

joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4

classe_final = joined[['cku', 'class']]

只是为了踢,你可以使用np.where条件加载:

classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )

这会评估价格是否大于percentil_75,如果是,那么第4类,否则它评估另一个条件等等,与loc相比可能值得计时,但它的可读性要差很多

答案 1 :(得分:0)

另一个解决方案,如果有人让我打赌哪一个是最快的,我会选择这个:

joined.set_index("product").eval(
    "1 * (price >= percentil_25)"
    "  + (price >= percentil_50)"
    "  + (price >= percentil_75)"
)