我有一个数据集,我translate.py
并将类别表示为qcut
。
如何使用我从原始数据集生成的相同类别对未来数据集进行分类?
解释性代码:
pandas.core.indexes.interval.IntervalIndex
我尝试了什么:
我尝试使用>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
>>> df
A B C D
0 51 92 14 71
1 60 20 82 86
2 74 74 87 99
3 23 2 21 52
4 1 87 29 37
5 1 63 59 20
6 32 75 57 21
7 88 48 90 58
8 41 91 59 79
9 14 61 61 46
>>> categories_a = pd.qcut(df['A'], 4).cat.categories
>>> type(categories_a)
pandas.core.indexes.interval.IntervalIndex
>>> categories_a
IntervalIndex([(0.999, 16.25], (16.25, 36.5], (36.5, 57.75], (57.75, 88.0]]
closed='right',
dtype='interval[float64]')
和CategoricalDtype
之类的内容但没有成功。我正在以优雅的方式完成关于如何做到这一点的想法。
预期结果:
鉴于df['B'].astype(categories_a)
和df
与上述相同,我想根据categories_a
将df['B']
中的所有元素转换为由.cat.codes
生成的元素pd.qcut(df['A'], 4).cat.codes
。输出看起来像:
df['B']
original --> processed # comment
92 --> 3 # (57.75, 88.0] this one actually goes through the roof
20 --> 1 # (16.25, 36.5]
74 --> 3 # (57.75, 88.0]
2 --> 0 # (0.999, 16.25]
87 --> 3 # (57.75, 88.0]
63 --> 3 # (57.75, 88.0]
75 --> 3 # (57.75, 88.0]
48 --> 2 # (36.5, 57.75]
91 --> 3 # (57.75, 88.0]
61 --> 3 # (57.75, 88.0]
我希望这很清楚。
答案 0 :(得分:2)
您可以将间隔的左右端点拼接在一起,以构建bins
以便在pd.cut
中使用。
def cut_by_cats(cats):
bins = [c[0].left] + [i.right for i in c]
def cut_(series):
return pd.cut(series, bins)
return cut_
cut = cut_by_cats(pd.qcut(df.A, 4).cat.categories)
df.apply(cut)
A B C D
0 (36.5, 57.75] NaN (0.999, 16.25] (57.75, 88.0]
1 (57.75, 88.0] (16.25, 36.5] (57.75, 88.0] (57.75, 88.0]
2 (57.75, 88.0] (57.75, 88.0] (57.75, 88.0] NaN
3 (16.25, 36.5] (0.999, 16.25] (16.25, 36.5] (36.5, 57.75]
4 (0.999, 16.25] (57.75, 88.0] (16.25, 36.5] (36.5, 57.75]
5 (0.999, 16.25] (57.75, 88.0] (57.75, 88.0] (16.25, 36.5]
6 (16.25, 36.5] (57.75, 88.0] (36.5, 57.75] (16.25, 36.5]
7 (57.75, 88.0] (36.5, 57.75] NaN (57.75, 88.0]
8 (36.5, 57.75] NaN (57.75, 88.0] (57.75, 88.0]
9 (0.999, 16.25] (57.75, 88.0] (57.75, 88.0] (36.5, 57.75]
答案 1 :(得分:2)
使用与上面相同的逻辑,但要获取已处理数据框中的类别而不是范围:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
series, bins = pd.qcut(df["A"], 4, retbins=True, labels=False)
def apply_cut(df):
for i in df.columns:
df[i] = pd.cut(df[i], bins=bins, labels=False, include_lowest=True)
return df
processed = apply_cut(df)
返回:
>>> processed
A B C D
0 2 NaN 0.0 3.0
1 3 1.0 3.0 3.0
2 3 3.0 3.0 NaN
3 1 0.0 1.0 2.0
4 0 3.0 1.0 2.0
5 0 3.0 3.0 1.0
6 1 3.0 2.0 1.0
7 3 2.0 NaN 3.0
8 2 NaN 3.0 3.0
9 0 3.0 3.0 2.0
如果你想摆脱NaN并将它们强制推入最接近的类别,你可以这样做,但是将-float(np.inf)
和float(np.inf)
添加到第一个和最后一个bin:
series, bins = pd.qcut(df["A"], 4, retbins=True, labels=False)
bins[0] = -float(np.inf)
bins[-1] = float(np.inf)
def apply_cut(df):
for i in df.columns:
df[i] = pd.cut(df[i], bins=bins, labels=False, include_lowest=True,right=False)
return df
processed = apply_cut(df)
>>> processed
A B C D
0 2 3 0 3
1 3 1 3 3
2 3 3 3 3
3 1 0 1 2
4 0 3 1 2
5 0 3 3 1
6 1 3 2 1
7 3 2 3 3
8 2 3 3 3
9 0 3 3 2
请根据需要删除/更改。这是你的帖子,我正在闯入( - :
bins = pd.qcut(df.A, 4, retbins=True)[1]
bins[0] -= np.finfo(np.float).resolution
df.apply(lambda c: pd.cut(c, bins))
A B C D
0 (36.5, 57.75] NaN (1.0, 16.25] (57.75, 88.0]
1 (57.75, 88.0] (16.25, 36.5] (57.75, 88.0] (57.75, 88.0]
2 (57.75, 88.0] (57.75, 88.0] (57.75, 88.0] NaN
3 (16.25, 36.5] (1.0, 16.25] (16.25, 36.5] (36.5, 57.75]
4 (1.0, 16.25] (57.75, 88.0] (16.25, 36.5] (36.5, 57.75]
5 (1.0, 16.25] (57.75, 88.0] (57.75, 88.0] (16.25, 36.5]
6 (16.25, 36.5] (57.75, 88.0] (36.5, 57.75] (16.25, 36.5]
7 (57.75, 88.0] (36.5, 57.75] NaN (57.75, 88.0]
8 (36.5, 57.75] NaN (57.75, 88.0] (57.75, 88.0]
9 (1.0, 16.25] (57.75, 88.0] (57.75, 88.0] (36.5, 57.75]
或者:
bins = pd.qcut(df.A, 4, retbins=True)[1]
bins[0] = -float(np.inf)
bins[-1] = float(np.inf)
processed = df.apply(lambda c: pd.cut(c, bins, labels=False))
>>> processed
A B C D
0 2 3 0 3
1 3 1 3 3
2 3 3 3 3
3 1 0 1 2
4 0 3 1 2
5 0 3 3 1
6 1 3 2 1
7 3 2 3 3
8 2 3 3 3
9 0 3 3 2