我有一个这样的数据框,有10M行:
probe
time
2016-01-01 00:05:00 3
2016-01-01 00:05:00 1
2016-01-01 00:05:00 5
2016-01-01 00:05:00 5
2016-01-01 00:05:00 4
2016-01-01 00:05:00 2
2016-01-01 00:05:00 5
2016-01-01 00:05:00 6
2016-01-01 00:05:00 3
2016-01-01 00:05:00 4
2016-01-01 00:05:00 5
2016-01-01 00:05:00 2
2016-01-01 00:05:00 3
2016-01-01 00:05:00 3
2016-01-01 00:05:00 5
Name: probe, dtype: uint8
我想根据probe
def categorize_R(x):
return "inner" if x['probe'] in (1, 4) else "outer"
data['category_R'] = pandas.Categorical(data.apply(categorize_R, axis=1))
这非常慢。实际上计算这样的面具:
mask_inner = (x['probe'] == 1) | (x['probe'] == 4)
非常快,但后来我不知道如何添加类别分类的列。
答案 0 :(得分:1)
我认为您需要使用由numpy.where
创建的掩码between
:
mask = data.probe.between(1,4)
data['category_R'] = pd.Categorical(np.where(mask, 'inner', 'outer'))
print (data)
probe category_R
time
2016-01-01 00:05:00 3 inner
2016-01-01 00:05:00 1 inner
2016-01-01 00:05:00 5 outer
2016-01-01 00:05:00 5 outer
2016-01-01 00:05:00 4 inner
2016-01-01 00:05:00 2 inner
2016-01-01 00:05:00 5 outer
2016-01-01 00:05:00 6 outer
2016-01-01 00:05:00 3 inner
2016-01-01 00:05:00 4 inner
2016-01-01 00:05:00 5 outer
2016-01-01 00:05:00 2 inner
2016-01-01 00:05:00 3 inner
2016-01-01 00:05:00 3 inner
2016-01-01 00:05:00 5 outer
另一个解决方案是使用Categorical.from_codes
,同时检查object creation - In [28]:
:
mask = (data['probe']==1) | (data['probe']==3) | (data['probe']==4)
mask = (data['probe']==1) | (data['probe']==3) | (data['probe']==4)
data['category_R'] = pd.Categorical(np.where(mask, 'inner', 'outer'))
data['category_R1'] = pd.Categorical.from_codes(mask, ['outer','inner'])
print (data)
probe category_R category_R1
time
2016-01-01 00:05:00 3 inner inner
2016-01-01 00:05:00 1 inner inner
2016-01-01 00:05:00 5 outer outer
2016-01-01 00:05:00 5 outer outer
2016-01-01 00:05:00 4 inner inner
2016-01-01 00:05:00 2 outer outer
2016-01-01 00:05:00 5 outer outer
2016-01-01 00:05:00 6 outer outer
2016-01-01 00:05:00 3 inner inner
2016-01-01 00:05:00 4 inner inner
2016-01-01 00:05:00 5 outer outer
2016-01-01 00:05:00 2 outer outer
2016-01-01 00:05:00 3 inner inner
2016-01-01 00:05:00 3 inner inner
2016-01-01 00:05:00 5 outer outer
<强>计时强>:
In [181]: %timeit pd.Categorical(np.where(mask, 'inner', 'outer'))
1000 loops, best of 3: 196 µs per loop
In [182]: %timeit pd.Categorical.from_codes(mask, ['outer','inner'])
10000 loops, best of 3: 139 µs per loop