Question

考虑以下数据框：

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
df = df.apply(LabelEncoder().fit_transform)
print(df)

目前输出：

   a  b  c
0  0  1  0
1  1  0  0

我的目标是通过传递我想要分享分类值的列来输出这样的内容：

   a  b  c
0  0  1  2
1  1  0  2

Answer 1

通过axis=1为每一行致电LabelEncoder().fit_transform一次。（默认情况下，df.apply(func)会为每列调用一次func。

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], 
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

encoder = LabelEncoder()

df = df.apply(encoder.fit_transform, axis=1)
print(df)

产量

   a  b  c
0  1  2  0
1  2  1  0

或者，您可以使用category dtype的数据并使用类别代码作为标签：

import pandas as pd

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], 
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

stacked = df.stack().astype('category')
result = stacked.cat.codes.unstack()
print(result)

也会产生

   a  b  c
0  1  2  0
1  2  1  0

这应该明显更快，因为它不需要为每一行调用encoder.fit_transform一次（如果你有很多行，这可能会给你带来可怕的性能）。

Answer 2

您可以使用pd.factorize执行此操作。

df = df.stack()
df[:] = pd.factorize(df)[0]
df.unstack()

   a  b  c
0  0  1  2
1  1  0  2

如果您只想encode数据框中的某些列，那么：

temp = df[['a', 'b']].stack()
temp[:] = temp.factorize()[0]
df[['a', 'b']] = temp.unstack()

   a  b        c
0  0  1  Belgium
1  1  0  Belgium

Answer 3

如果编码顺序无关紧要，您可以执行以下操作：

df_new = (         
    pd.DataFrame(columns=df.columns,
                 data=LabelEncoder()
                 .fit_transform(df.values.flatten()).reshape(df.shape))
)

df_new
Out[27]: 
   a  b  c
0  1  2  0
1  2  1  0

Answer 4

这是使用分类数据的替代解决方案。与@unutbu相似，但保留因式分解的顺序。换句话说，找到的第一个值将具有代码0。

df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
                        ["Italy", "France", "Belgium"]],
                  columns=["a", "b", "c"])

# get unique values in order
vals = df.T.stack().unique()

# convert to categories and then extract codes
for col in df:
    df[col] = pd.Categorical(df[col], categories=vals)
    df[col] = df[col].cat.codes

print(df)

   a  b  c
0  0  1  2
1  1  0  2

标签编码具有相同类别的多个列

4 个答案: