我有一个名为Joined with 5 fields的pandas DataFrame:
product | price | percentil_25 | percentil_50 | percentile_75
对于每一行,我想按如下方式对价格进行分类:
如果价格低于百分位数_25我给这个产品1级,依此类推
所以我做的是:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
但是由于我的DataFrame非常庞大,所以它会永远消失。
你知道我怎么能更快地做到这一点吗?
答案 0 :(得分:0)
# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2种方式,定义一个函数并调用apply
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
另一种我认为更好并且速度更快的方法是我们可以将'class'列添加到您现有的df并使用loc
,然后只查看感兴趣的2列:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
只是为了踢,你可以使用np.where
条件加载:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )
这会评估价格是否大于percentil_75,如果是,那么第4类,否则它评估另一个条件等等,与loc相比可能值得计时,但它的可读性要差很多
答案 1 :(得分:0)
另一个解决方案,如果有人让我打赌哪一个是最快的,我会选择这个:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)