Question

我有一个Pandas::Series对象，其中包含重复的字符串值，我需要将规范化转换为int值以提供给TensorFlow。

我已根据this将其转换为Category，但它会为每个项目创建一个代码，而不是识别重复项。

e.g。我希望进行以下转换

['a', 'b', 'c', 'd', 'a', 'a', 'c'] -> [1, 2, 3, 4, 1, 1, 3]

Answer 1

您需要稍微更改factorize：

print ((pd.factorize(['a', 'b', 'c', 'd', 'a', 'a', 'c'])[0] + 1).tolist())
[1, 2, 3, 4, 1, 1, 3]

Answer 2

转换为类别

后，您需要添加cat.codes

pd.Series(['a', 'b', 'c', 'd', 'a', 'a', 'c']).astype('category').cat.codes+1
Out[1407]: 
0    1
1    2
2    3
3    4
4    1
5    1
6    3
dtype: int8

Pandas String Series为Tensor的int规范化

2 个答案: