我正在尝试将数据集中的所有值转换为分类值,我希望根据其分位数将所有数值分类为低,平均或高。
因此,如果该值低于系列的25%,则将其转换为“低”
我尝试使用分配,然后应用了我提供的功能:
def turn_into_categorical(row):
quantile_level = [.25, .5, .75]
for r in row:
cut = refugees_T_F_V_P_full_data.r.quantile(quantile_level)
if r >= cut[.75]:
return "High"
elif r >= cut[.25] and r < cut[0.75]:
return "Average"
else:
return "Low"
refugees_T_F_V_P_full_data.apply(turn_into_categorical, axis = 1)
但是,代码无法正常运行。我也通过迭代尝试过,但是我想知道是否有更快的方法?
这是我要转换的数据,除年和月以外的所有数字都应根据分位数来分类为低,中,高。
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 6.0 1.0 0.0 3.0
1 2014 11 4.0 3.0 0.0 12.0
2 2014 12 3.0 5.0 0.0 11.0
3 2015 1 7.0 2.0 0.0 4.0
4 2015 2 5.0 5.0 0.0 10.0
5 2015 3 7.0 5.0 0.0 8.0
6 2015 4 4.0 1.0 0.0 6.0
7 2015 5 5.0 0.0 0.0 7.0
8 2015 6 4.0 1.0 0.0 6.0
9 2015 7 15.0 2.0 0.0 9.0
10 2015 8 10.0 7.0 0.0 9.0
11 2015 9 12.0 0.0 0.0 8.0
12 2015 10 12.0 0.0 0.0 5.0
13 2015 11 8.0 5.0 0.0 10.0
14 2015 12 5.0 7.0 0.0 3.0
预期结果:(示例)
Year Month Central Equatoria Eastern Equatoria Gogrial Jonglei
0 2014 10 High Medium Low Medium
1 2014 11 Low Medium Low high
答案 0 :(得分:2)
看起来像您想要的pd.qcut
,它正是这样做的。从文档中:
基于分位数的离散化功能
因此,您可以apply
pd.qcut
从Central Equatoria
开始沿数据框的各列,指定要用于将序列与q = [0, 0.25, 0.75, 1.0]
进行分档的分位数>
df.loc[:,'Central Equatoria':].apply(lambda x: pd.qcut(x, q=[0, 0.25, 0.75, 1.0],
labels =['low','medium','high'])
if not x.nunique() == 1 else 'low'))
输出
Central Equatoria Eastern Equatoria Gogrial Jonglei
0 medium low low low
1 low medium low high
2 low medium low high
3 medium medium low low
4 medium medium low high
5 medium medium low medium
6 low low low medium
7 medium low low medium
8 low low low medium
9 high medium low medium
10 high high low medium
11 high low low medium
12 high low low low
13 medium medium low high
14 medium high low low
答案 1 :(得分:1)
将pd.DataFrame.quantile
与pd.Series.cut
结合使用的一个想法:
cats = ['Low', 'Medium', 'High']
quantiles = df.iloc[:, 2:].quantile([0, 0.25, 0.75, 1.0])
for col in df.iloc[:, 2:]:
bin_edges = quantiles[col]
# special case situations where all values are equal
if bin_edges.nunique() == 1:
df[col] = 'Low'
else:
df[col] = pd.cut(df[col], bins=bin_edges, labels=cats, include_lowest=True)
结果:
print(df)
Year Month CentralEquatoria EasternEquatoria Gogrial Jonglei
0 2014 10 Medium Low Low Low
1 2014 11 Low Medium Low High
2 2014 12 Low Medium Low High
3 2015 1 Medium Medium Low Low
4 2015 2 Medium Medium Low High
5 2015 3 Medium Medium Low Medium
6 2015 4 Low Low Low Medium
7 2015 5 Medium Low Low Medium
8 2015 6 Low Low Low Medium
9 2015 7 High Medium Low Medium
10 2015 8 High High Low Medium
11 2015 9 High Low Low Medium
12 2015 10 High Low Low Low
13 2015 11 Medium Medium Low High
14 2015 12 Medium High Low Low
答案 2 :(得分:0)
使用pd.cut()
和df.apply()
:
df.iloc[:,2:]=df.iloc[:,2:].apply(lambda x:pd.cut(x, 3, labels=['Low','Med','High']), axis=1)
Year Month Central_Equatoria Eastern_Equatoria Gogrial Jonglei
0 2014 10 High Low Low Med
1 2014 11 Low Low Low High
2 2014 12 Low Med Low High
3 2015 1 High Low Low Med
4 2015 2 Med Med Low High
5 2015 3 High Med Low High
答案 3 :(得分:0)
最后使用最古老的方式:
new_df = pd.DataFrame()
name_list = list(df)
for name in name_list:
if name != 'Year' and name != 'Month':
new_row = []
quantiles = df[name].quantile([.25, .5, .75])
row_list = df[name].tolist()
for i, value in enumerate(row_list):
if value < quantiles[.25]:
new_row.append("Low")
elif value < quantiles[.75] and value >= quantiles[.25]:
new_row.append("Average")
else:
new_row.append("High")
series = pd.Series(new_row)
new_df[name] = series.values
new_df.head()