在非数字数据上滚动多数

时间:2015-07-03 09:08:19

标签: pandas categorical-data

给定数据框:

df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})

我想替换列中的每个值' a'围绕' a'的大部分价值观。对于数值数据,我可以这样做:

def majority(window):
    freqs = scipy.stats.itemfreq(window)
    max_votes = freqs[:,1].argmax()
    return freqs[max_votes,0]

df['a'] = pd.rolling_apply(df['a'], 3, majority)

我得到了:

In [43]: df
Out[43]: 
     a
0  NaN
1  NaN
2    1
3    1
4    1
5    1
6    1
7    2
8    2
9    2
10   2

我必须处理NaN,但除此之外,这或多或少都是我想要的......除此之外,我还想做同样的事情。非数字列,但Pandas似乎不支持这个:

In [47]: df['b'] = list('aaaababbbba')
In [49]: df['b'] = pd.rolling_apply(df['b'], 3, majority)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-49-507f45aab92c> in <module>()
----> 1 df['b'] = pd.rolling_apply(df['b'], 3, majority)

/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in rolling_apply(arg, window, func, min_periods, freq, center, args, kwargs)
    751         return algos.roll_generic(arg, window, minp, offset, func, args, kwargs)
    752     return _rolling_moment(arg, window, call_cython, min_periods, freq=freq,
--> 753                            center=False, args=args, kwargs=kwargs)
    754 
    755 

/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _rolling_moment(arg, window, func, minp, axis, freq, center, how, args, kwargs, **kwds)
    382     arg = _conv_timerule(arg, freq, how)
    383 
--> 384     return_hook, values = _process_data_structure(arg)
    385 
    386     if values.size == 0:

/usr/local/lib/python2.7/dist-packages/pandas/stats/moments.pyc in _process_data_structure(arg, kill_inf)
    433 
    434     if not issubclass(values.dtype.type, float):
--> 435         values = values.astype(float)
    436 
    437     if kill_inf:

ValueError: could not convert string to float: a

我尝试将a转换为Categorical,但即便如此,我也会遇到同样的错误。我可以先转换为Categorical,处理codes,然后最终从代码转换回标签,但这看起来真的很复杂。

是否有更容易/更自然的解决方案?

(顺便说一句:我只限于NumPy 1.8.2所以我必须使用itemfreq代替unique,请参阅here。)

2 个答案:

答案 0 :(得分:6)

这是一种方法,使用pd.Categorical:

import scipy.stats as stats
import pandas as pd

def majority(window):
    freqs = stats.itemfreq(window)
    max_votes = freqs[:,1].argmax()
    return freqs[max_votes,0]

df = pd.DataFrame({'a' : [1,1,1,1,1,2,1,2,2,2,2]})
df['a'] = pd.rolling_apply(df['a'], 3, majority)
df['b'] = list('aaaababbbba')

cat = pd.Categorical(df['b'])
df['b'] = pd.rolling_apply(cat.codes, 3, majority)
df['b'] = df['b'].map(pd.Series(cat.categories))
print(df)

产量

     a    b
0  NaN  NaN
1  NaN  NaN
2    1    a
3    1    a
4    1    a
5    1    a
6    1    b
7    2    b
8    2    b
9    2    b
10   2    b

答案 1 :(得分:1)

这是通过定义自己的滚动应用函数来实现的一种方法。

{{1}}