Question

我想生成＆＃34;类别区间＆＃34;从类别。例如，假设我有以下内容：

>>> df['start'].describe()
count    259431.000000
mean         10.435858
std           5.504730
min           0.000000
25%           6.000000
50%          11.000000
75%          15.000000
max          20.000000
Name: start, dtype: float64

我的专栏的唯一值是：

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20], dtype=int8)

但我想使用以下间隔列表：

>>> intervals
[[0, 2.2222222222222223],
 [2.2222222222222223, 4.4444444444444446],
 [4.4444444444444446, 6.666666666666667],
 [6.666666666666667, 8.8888888888888893],
 [8.8888888888888893, 11.111111111111111],
 [11.111111111111111, 13.333333333333332],
 [13.333333333333332, 15.555555555555554],
 [15.555555555555554, 17.777777777777775],
 [17.777777777777775, 20]]

更改我的专栏＆＃39; start＆＃39;转换为值x，其中x表示包含df['start']的时间间隔的索引（因此x在我的情况下将从0到8变化）

使用pandas / numpy有一种或多或少的简单方法吗？

事先，非常感谢你的帮助。

问候。

Answer 1

您可以使用np.digitize：

import numpy as np
import pandas as pd

df = pd.DataFrame(dict(start=np.random.random_integers(0, 20, 10000)))

# the left-hand edges of each "interval"
intervals = np.linspace(0, 20, 9, endpoint=False)
print(intervals)
# [  0.           2.22222222   4.44444444   6.66666667   8.88888889
#   11.11111111  13.33333333  15.55555556  17.77777778]    

df['start_idx'] = np.digitize(df['start'], intervals) - 1

print(df.head())
#    start  start_idx
# 0      8          3
# 1     16          7
# 2      0          0
# 3      7          3
# 4      0          0

生成＆＃34;类别间隔＆＃34;从类别

1 个答案: