Question

如果我有两列如下：

Origin  Destination  
China   USA  
China   Turkey  
USA     China  
USA     Turkey  
USA     Russia  
Russia  China

如何确保Origin列的标签与目标列中的标签匹配，即

，我将如何执行标签编码

Origin  Destination  
0   1  
0   3  
1   0  
1   0  
1   0  
2   1

如果我分别对每列进行编码，那么算法会看到column1中的中国与column2不同，不是这样的

Answer 1

`stack`

df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

`factorize` `reshape`

pd.DataFrame(
    pd.factorize(df.values.ravel())[0].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

`np.unique`和`reshape`

pd.DataFrame(
    np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

令人作呕的选择

我无法停止尝试...抱歉！

df.applymap(
    lambda x, y={}, c=itertools.count():
        y.get(x) if x in y else y.setdefault(x, next(c))
)

   Origin  Destination
0       0            1
1       0            3
2       1            0
3       1            3
4       1            2
5       2            0

正如cᴏʟᴅsᴘᴇᴇᴅ

指出的那样

您可以通过分配回数据框

来缩短此时间

df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)

Answer 2

pandas方法

您可以创建{country: value}对的字典，并将数据框映射到：

country_map = {country:i for i, country in enumerate(df.stack().unique())}

df['Origin'] = df['Origin'].map(country_map)    
df['Destination'] = df['Destination'].map(country_map)

>>> df
   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

sklearn方法

由于您标记了sklearn，因此您可以使用LabelEncoder()：

from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())

df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])

>>> df
   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

要取回原始标签：

>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)

Answer 3

您可以使用replace

df.replace(dict(zip(np.unique(df.values),list(range(len(np.unique(df.values)))))))
   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

Pir的简洁而不错的答案

df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))

并且

df.replace(dict(zip(np.unique(df), itertools.count())))

Answer 4

修改：刚刚找到#!/bin/bash result=$(bash output.sh | sed 's/[^0-9]//g') r=$((result+1)) echo $r的{{1}}选项。无需搜索和替换！

return_inverse

您可以将np.unique的矢量化版本与

一起使用

df.values[:] = np.unique(df, return_inverse=True)[1].reshape(-1,2)

或者您可以创建一个单热编码数组并使用argmax恢复索引。如果有很多国家，可能不是一个好主意。

np.searchsorted

Answer 5

使用LabelEncoder中的sklearn，您也可以尝试：

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.values.flatten())

df = df.apply(le.fit_transform)
print(df)

结果：

   Origin  Destination
0       0            3
1       0            2
2       2            0
3       2            2
4       2            1
5       1            0

如果您有更多列，并且只想应用于所选的数据帧列，那么您可以尝试：

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# columns to select for encoding
selected_col = ['Origin','Destination']
le.fit(df[selected_col].values.flatten())

df[selected_col] = df[selected_col].apply(le.fit_transform)
print(df)

在sckit-learn中跨多个具有相同属性的列进行标签编码

5 个答案:

`stack`

`factorize` `reshape`

`np.unique`和`reshape`

令人作呕的选择

在sckit-learn中跨多个具有相同属性的列进行标签编码

5 个答案:

stack

factorize reshape

np.unique和reshape

令人作呕的选择

`stack`

`factorize` `reshape`

`np.unique`和`reshape`