如何使用IntervalIndex生成.cat.codes

时间:2018-03-13 20:56:31

标签: python pandas dataframe categories categorical-data

我有一个数据集,我translate.py并将类别表示为qcut

如何使用我从原始数据集生成的相同类别对未来数据集进行分类?

解释性代码:

pandas.core.indexes.interval.IntervalIndex

我尝试了什么:

我尝试使用>>> import pandas as pd >>> import numpy as np >>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD')) >>> df A B C D 0 51 92 14 71 1 60 20 82 86 2 74 74 87 99 3 23 2 21 52 4 1 87 29 37 5 1 63 59 20 6 32 75 57 21 7 88 48 90 58 8 41 91 59 79 9 14 61 61 46 >>> categories_a = pd.qcut(df['A'], 4).cat.categories >>> type(categories_a) pandas.core.indexes.interval.IntervalIndex >>> categories_a IntervalIndex([(0.999, 16.25], (16.25, 36.5], (36.5, 57.75], (57.75, 88.0]] closed='right', dtype='interval[float64]') CategoricalDtype之类的内容但没有成功。我正在以优雅的方式完成关于如何做到这一点的想法。

预期结果:

鉴于df['B'].astype(categories_a)df与上述相同,我想根据categories_adf['B']中的所有元素转换为由.cat.codes生成的元素pd.qcut(df['A'], 4).cat.codes。输出看起来像:

df['B'] 
original --> processed # comment
92 --> 3 # (57.75, 88.0] this one actually goes through the roof
20 --> 1 # (16.25, 36.5]
74 --> 3 # (57.75, 88.0]
2  --> 0 # (0.999, 16.25]
87 --> 3 # (57.75, 88.0]
63 --> 3 # (57.75, 88.0]
75 --> 3 # (57.75, 88.0]
48 --> 2 # (36.5, 57.75]
91 --> 3 # (57.75, 88.0]
61 --> 3 # (57.75, 88.0]

我希望这很清楚。

2 个答案:

答案 0 :(得分:2)

您可以将间隔的左右端点拼接在一起,以构建bins以便在pd.cut中使用。

def cut_by_cats(cats):
    bins = [c[0].left] + [i.right for i in c]
    def cut_(series):
        return pd.cut(series, bins)
    return cut_

cut = cut_by_cats(pd.qcut(df.A, 4).cat.categories)

df.apply(cut)

                A               B               C              D
0   (36.5, 57.75]             NaN  (0.999, 16.25]  (57.75, 88.0]
1   (57.75, 88.0]   (16.25, 36.5]   (57.75, 88.0]  (57.75, 88.0]
2   (57.75, 88.0]   (57.75, 88.0]   (57.75, 88.0]            NaN
3   (16.25, 36.5]  (0.999, 16.25]   (16.25, 36.5]  (36.5, 57.75]
4  (0.999, 16.25]   (57.75, 88.0]   (16.25, 36.5]  (36.5, 57.75]
5  (0.999, 16.25]   (57.75, 88.0]   (57.75, 88.0]  (16.25, 36.5]
6   (16.25, 36.5]   (57.75, 88.0]   (36.5, 57.75]  (16.25, 36.5]
7   (57.75, 88.0]   (36.5, 57.75]             NaN  (57.75, 88.0]
8   (36.5, 57.75]             NaN   (57.75, 88.0]  (57.75, 88.0]
9  (0.999, 16.25]   (57.75, 88.0]   (57.75, 88.0]  (36.5, 57.75]

答案 1 :(得分:2)

使用与上面相同的逻辑,但要获取已处理数据框中的类别而不是范围:

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))


series, bins = pd.qcut(df["A"], 4, retbins=True, labels=False)
def apply_cut(df):
    for i in df.columns:
        df[i] = pd.cut(df[i], bins=bins, labels=False, include_lowest=True)
    return df

processed = apply_cut(df)

返回:

>>> processed
   A    B    C    D
0  2  NaN  0.0  3.0
1  3  1.0  3.0  3.0
2  3  3.0  3.0  NaN
3  1  0.0  1.0  2.0
4  0  3.0  1.0  2.0
5  0  3.0  3.0  1.0
6  1  3.0  2.0  1.0
7  3  2.0  NaN  3.0
8  2  NaN  3.0  3.0
9  0  3.0  3.0  2.0

如果你想摆脱NaN并将它们强制推入最接近的类别,你可以这样做,但是将-float(np.inf)float(np.inf)添加到第一个和最后一个bin:

series, bins = pd.qcut(df["A"], 4, retbins=True, labels=False)
bins[0] = -float(np.inf)
bins[-1] = float(np.inf)
def apply_cut(df):
    for i in df.columns:
        df[i] = pd.cut(df[i], bins=bins, labels=False, include_lowest=True,right=False)
    return df

processed = apply_cut(df)

>>> processed
   A  B  C  D
0  2  3  0  3
1  3  1  3  3
2  3  3  3  3
3  1  0  1  2
4  0  3  1  2
5  0  3  3  1
6  1  3  2  1
7  3  2  3  3
8  2  3  3  3
9  0  3  3  2

piR编辑

请根据需要删除/更改。这是你的帖子,我正在闯入( - :

bins = pd.qcut(df.A, 4, retbins=True)[1]
bins[0] -= np.finfo(np.float).resolution

df.apply(lambda c: pd.cut(c, bins))

               A              B              C              D
0  (36.5, 57.75]            NaN   (1.0, 16.25]  (57.75, 88.0]
1  (57.75, 88.0]  (16.25, 36.5]  (57.75, 88.0]  (57.75, 88.0]
2  (57.75, 88.0]  (57.75, 88.0]  (57.75, 88.0]            NaN
3  (16.25, 36.5]   (1.0, 16.25]  (16.25, 36.5]  (36.5, 57.75]
4   (1.0, 16.25]  (57.75, 88.0]  (16.25, 36.5]  (36.5, 57.75]
5   (1.0, 16.25]  (57.75, 88.0]  (57.75, 88.0]  (16.25, 36.5]
6  (16.25, 36.5]  (57.75, 88.0]  (36.5, 57.75]  (16.25, 36.5]
7  (57.75, 88.0]  (36.5, 57.75]            NaN  (57.75, 88.0]
8  (36.5, 57.75]            NaN  (57.75, 88.0]  (57.75, 88.0]
9   (1.0, 16.25]  (57.75, 88.0]  (57.75, 88.0]  (36.5, 57.75]

或者:

bins = pd.qcut(df.A, 4, retbins=True)[1]
bins[0] = -float(np.inf)
bins[-1] = float(np.inf)

processed = df.apply(lambda c: pd.cut(c, bins, labels=False))

>>> processed
   A  B  C  D
0  2  3  0  3
1  3  1  3  3
2  3  3  3  3
3  1  0  1  2
4  0  3  1  2
5  0  3  3  1
6  1  3  2  1
7  3  2  3  3
8  2  3  3  3
9  0  3  3  2