我想在我的pandas数据帧的某些变量中使用pd.cut
(将连续变量转换为离散变量),但我希望该范围取决于其他列。想象一下,我想要3个垃圾箱。
例如:
+------+------+------+--------+
| col1 | col2 | col3 | sector |
+------+------+------+--------+
| 4.5 | 6 | 7 | a |
+------+------+------+--------+
| 8 | 9 | 17 | a |
+------+------+------+--------+
| 0 | 9 | 8 | b |
+------+------+------+--------+
| 8 | 9 | 0 | b |
+------+------+------+--------+
| 1 | 2 | 3.5 | b |
+------+------+------+--------+
我只想基于col1
将col2
和sector
切成3个档,因此对于每个扇区都执行切割。这对于比较不同来源的变量非常有用。
结果将是(它组成了,不要期望它是100%准确的):
+----------+----------+------+--------+
| col1_cut | col2_cut | col3 | sector |
+----------+----------+------+--------+
| 2 | 2 | 7 | a |
+----------+----------+------+--------+
| 3 | 3 | 17 | a |
+----------+----------+------+--------+
| 1 | 3 | 8 | b |
+----------+----------+------+--------+
| 3 | 3 | 0 | b |
+----------+----------+------+--------+
| 1 | 1 | 3.5 | b |
+----------+----------+------+--------+
PS:之所以进行此问答,是因为我遇到了这个问题,无法找到 解决自己。随意回答您自己的解决方案或进行改进 我的,感谢您的反馈。
答案 0 :(得分:2)
我认为可以缩短为
s=pd.concat([y[['col1','col2']].apply(pd.cut,bins=3,labels=False)for x, y in df.groupby('sector')])
s
Out[157]:
col1 col2
0 0 0
1 2 2
2 0 2
3 2 2
4 0 0
df.update(s)
答案 1 :(得分:1)
要执行该操作,您只需:
col_add = []
sectors = df['sector'].unique()
for col in df.columns:
if col in ['col1','col2']:
col_add.append(col)
df['{}_cut'.format(col)] = 0 # Initialized (not needed but I like to)
for sector in sectors:
df['{}_cut'.format(col)][df['sector'] == sector] = pd.cut(df[col][df['sector'] == sector], 3, labels=False)
df.drop(col_add, axis = 1, inplace = True) # Remove old cols