Question

我正在尝试将分类变量转换为整数。但是，我希望他们使用相同的密钥（A在所有字段中转换为1.我的下面的代码不使用相同的密钥。

import pandas as pd

df1 = pd.DataFrame({'A' : ['A', 'A', 'C', 'D','B']})

df2 = pd.DataFrame({'A' : ['D', 'D', 'B', 'A','A']})

df1_int = pd.factorize(df1['A'])[0]
print df1_int

df2_int = pd.factorize(df2['A'])[0]
print df2_int

这是我得到的输出：

    [0 0 1 2 3]
    [0 0 1 2 2]

Answer 1

您可以将现有列转换为分类dtype，当您对两者使用相同的类别时，基础整数值（您可以作为codes到Series.cat.codes访问）将保持一致两个数据帧之间：

In [5]: df1['A'].astype('category', categories=list('ABCD')).cat.codes
Out[5]:
0    0
1    0
2    2
3    3
4    1
dtype: int8

In [6]: df2['A'].astype('category', categories=list('ABCD')).cat.codes
Out[6]:
0    3
1    3
2    1
3    0
4    0
dtype: int8

如果您不想手动指定类别，您还可以重复使用第一个数据帧的类别，以确保它们相同：

df1['A'] = df1['A'].astype('category')
df2['A'] = df2['A'].astype('category', categories=df1['A'].cat.categories)

注意：astype('category', categories=...)仅适用于pandas＆gt; = 0.16，使用pandas 0.15，您可以先将其转换为类别dtype，然后使用set_categories设置类别（请参阅docs ）。

Answer 2

当您尝试从一个DataFrame中学习类别以应用于其他DataFrame时，使用scikit-learn可能会提供更优雅的解决方案：

from sklearn import preprocessing
import pandas as pd

df1 = pd.DataFrame({'A' : ['A', 'A', 'C', 'D','B'],
                    'B' : ['one', 'one', 'two', 'three','four']})

df2 = pd.DataFrame({'A' : ['D', 'D', 'B', 'A','A'],
                    'B' : ['one', 'five', 'two', 'three','four']})

le = preprocessing.LabelEncoder()
df1_int = le.fit_transform(df1['A'])
print df1_int

df2_int = le.transform(df2['A'])
print df2_int

结果：

[0 0 2 3 1]
[3 3 1 0 0]

使用pandas将分类变量转换为整数

2 个答案: