如何在pandas中添加新的分类列

时间:2017-02-15 12:31:14

标签: pandas optimization categorical-data

我有一个这样的数据框,有10M行:

                     probe
time                      
2016-01-01 00:05:00    3
2016-01-01 00:05:00    1
2016-01-01 00:05:00    5
2016-01-01 00:05:00    5
2016-01-01 00:05:00    4
2016-01-01 00:05:00    2
2016-01-01 00:05:00    5
2016-01-01 00:05:00    6
2016-01-01 00:05:00    3
2016-01-01 00:05:00    4
2016-01-01 00:05:00    5
2016-01-01 00:05:00    2
2016-01-01 00:05:00    3
2016-01-01 00:05:00    3
2016-01-01 00:05:00    5
Name: probe, dtype: uint8

我想根据probe

的值添加分类列
def categorize_R(x):
    return "inner" if x['probe'] in (1, 4) else "outer"

data['category_R'] = pandas.Categorical(data.apply(categorize_R, axis=1))

这非常慢。实际上计算这样的面具:

mask_inner = (x['probe'] == 1) | (x['probe'] == 4)

非常快,但后来我不知道如何添加类别分类的列。

1 个答案:

答案 0 :(得分:1)

我认为您需要使用由numpy.where创建的掩码between

mask = data.probe.between(1,4)
data['category_R']  = pd.Categorical(np.where(mask, 'inner', 'outer'))
print (data)
                     probe category_R
time                                 
2016-01-01 00:05:00      3      inner
2016-01-01 00:05:00      1      inner
2016-01-01 00:05:00      5      outer
2016-01-01 00:05:00      5      outer
2016-01-01 00:05:00      4      inner
2016-01-01 00:05:00      2      inner
2016-01-01 00:05:00      5      outer
2016-01-01 00:05:00      6      outer
2016-01-01 00:05:00      3      inner
2016-01-01 00:05:00      4      inner
2016-01-01 00:05:00      5      outer
2016-01-01 00:05:00      2      inner
2016-01-01 00:05:00      3      inner
2016-01-01 00:05:00      3      inner
2016-01-01 00:05:00      5      outer

另一个解决方案是使用Categorical.from_codes,同时检查object creation - In [28]:

mask = (data['probe']==1) | (data['probe']==3) | (data['probe']==4)

mask = (data['probe']==1) | (data['probe']==3) | (data['probe']==4)
data['category_R']  = pd.Categorical(np.where(mask, 'inner', 'outer'))
data['category_R1']  = pd.Categorical.from_codes(mask, ['outer','inner'])
print (data)
                     probe category_R category_R1
time                                             
2016-01-01 00:05:00      3      inner       inner
2016-01-01 00:05:00      1      inner       inner
2016-01-01 00:05:00      5      outer       outer
2016-01-01 00:05:00      5      outer       outer
2016-01-01 00:05:00      4      inner       inner
2016-01-01 00:05:00      2      outer       outer
2016-01-01 00:05:00      5      outer       outer
2016-01-01 00:05:00      6      outer       outer
2016-01-01 00:05:00      3      inner       inner
2016-01-01 00:05:00      4      inner       inner
2016-01-01 00:05:00      5      outer       outer
2016-01-01 00:05:00      2      outer       outer
2016-01-01 00:05:00      3      inner       inner
2016-01-01 00:05:00      3      inner       inner
2016-01-01 00:05:00      5      outer       outer

<强>计时

In [181]: %timeit pd.Categorical(np.where(mask, 'inner', 'outer'))
1000 loops, best of 3: 196 µs per loop

In [182]: %timeit pd.Categorical.from_codes(mask, ['outer','inner'])
10000 loops, best of 3: 139 µs per loop