我有一个pandas数据框,我从csv文件读入的值。我有一个标有' SleepQuality'并且值从0.0到100.0浮点。我想创建一个标有' SleepQualityGroup'其中原始列btw 0 - 49中的值在新列中的值为0,50 - 59 = 1,60 - 69 = 2,77 - 79 = 3,80 - 89 = 4和90 - 100 = 5
为了做到这一点,最好的配方是什么?我被困在识别每个范围内所有值并分配给新值所需的逻辑上。
新的' SleepQualityGroup'中输出的内容示例。列。
SleepQuality SleepQualityGroup
80.4 4
90.1 5
66.4 2
50.3 1
86.2 4
75.4 3
45.7 0
91.5 5
61.3 2
54 1
58.2 1
答案 0 :(得分:11)
使用pd.cut
即
df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])
输出:
SleepQuality SleepQualityGroup new 0 80.4 4 4 1 90.1 5 5 2 66.4 2 2 3 50.3 1 1 4 86.2 4 4 5 75.4 3 3 6 45.7 0 0 7 91.5 5 5 8 61.3 2 2 9 54.0 1 1 10 58.2 1 1
答案 1 :(得分:6)
这基本上是一个分箱操作。因为这里可以使用这两种工具。
使用np.searchsorted
-
bins = np.arange(50,100,10)
df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)
使用np.digitize
-
df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)
示例输出 -
In [866]: df
Out[866]:
SleepQuality SleepQualityGroup
0 80.4 4
1 90.1 5
2 66.4 2
3 50.3 1
4 86.2 4
5 75.4 3
6 45.7 0
7 91.5 5
8 61.3 2
9 54.0 1
10 58.2 1
运行时测试 -
In [921]: df
Out[921]:
SleepQuality SleepQualityGroup
0 80.4 4
1 90.1 5
2 66.4 2
3 50.3 1
4 86.2 4
5 75.4 3
6 45.7 0
7 91.5 5
8 61.3 2
9 54.0 1
10 58.2 1
In [922]: df = pd.concat([df]*10000,axis=0)
# @Dark's soln using pd.cut
In [923]: %timeit df['new'] = pd.cut(df['SleepQuality'],bins=[0,50 , 60, 70 , 80 , 90,100], labels=[0,1,2,3,4,5])
1000 loops, best of 3: 1.04 ms per loop
In [926]: %timeit df['SleepQualityGroup'] = bins.searchsorted(df.SleepQuality)
1000 loops, best of 3: 591 µs per loop
In [927]: %timeit df['SleepQualityGroup'] = np.digitize(df.SleepQuality, bins)
1000 loops, best of 3: 538 µs per loop