我的原始数据集看起来像下面的示例:
| id | old_a | new_a | old_b | new_b | ratio_a | ratio_b |
|----|-------|-------|-------|-------|----------|---------|
| 1 | 350 | 6 | 35 | 0 | 58.33333 | Inf |
| 2 | 164 | 79 | 6 | 2 | 2.075949 | 3 |
| 3 | 10 | 0 | 1 | 1 | Inf | 1 |
| 4 | 120 | 1 | 10 | 0 | 120 | Inf |
以下是数据框:
df=[[1,350,6,35,0],[2,164,79,6,2],[3,10,0,1,1],[4,120,1,10,0]]
df= pd.DataFrame(df,columns=['id','old_a','new_a','old_b','new_b'])
我已使用以下代码获得了列“ ratio_a”和“ ratio_b”(如表所示):
df['ratio_a']= df['old_a']/df['new_a']
df['ratio_b']= df['old_b']/df['new_b']
接下来,我想再创建两个数字范围的列,其中ratio_a和ratio_b的值将落入其中。为此,我编写了以下代码:
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)
我遇到的一个问题是,如果ratio_a和ratio_b中的任何值大于100,则它应属于存储桶'> 100'中。我怎样才能做到这一点? 我的最终结果应如下所示:
| id | old_a | new_a | old_b | new_b | ratio_a | ratio_b | a_range | b_range |
|----|-------|-------|-------|-------|----------|---------|---------|---------|
| 1 | 350 | 6 | 35 | 0 | 58.33333 | Inf | 40-50 | NaN |
| 2 | 164 | 79 | 6 | 2 | 2.075949 | 3 | 0-10 | 0-10 |
| 3 | 10 | 0 | 1 | 1 | Inf | 1 | NaN | 0-10 |
| 4 | 120 | 1 | 10 | 0 | 120 | Inf | >100 | NaN |
答案 0 :(得分:1)
一种可能的解决方案:
bins = [0,10,20,30,40,50,60,70,80,90,100,np.inf]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
labels[-1]=">100"
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)
结果:
id old_a new_a old_b new_b ratio_a ratio_b a_range b_range
1 350 6 35 0 58.333333 inf 50-60 NaN
2 164 79 6 2 2.075949 3.0 0-10 0-10
3 10 0 1 1 inf 1.0 NaN 0-10
4 120 1 10 0 120.000000 inf >100 NaN