Question

我有一个字符串列表，我需要将其转换为数字标签列表。示例：

x= ['hello', 'John', 'hi', 'John', 'hello', 'pumpum']
# output should be something like this:
y=[0, 1, 2, 1, 0, 3]

NB。该列表有10万个字符串，我正在从文件中读取它。

Answer 1

您可以使用字典：

d = {}
x= ['hello', 'John', 'hi', 'John', 'hello', 'pumpum']
count = 0
for i in x:
  if i not in d:
     d[i] = count
     count += 1

new_x = [d[i] for i in x]

输出：

[0, 1, 2, 1, 0, 3]

Answer 2

如果阵列很大，那么sklearn有一种优化的方法可以使用LabelEncoder来做到这一点：

In[124]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
x= ['hello', 'John', 'hi', 'John', 'hello', 'pumpum']
le.fit(x)
y = le.transform(x)
y

Out[124]: array([1, 0, 2, 0, 1, 3], dtype=int64)

这将返回一个numpy数组，您可以从中执行其他操作并与scipy堆栈兼容

Answer 3

如果您愿意使用第三方库，则可以使用numpy.unique：

{{1}}

Answer 4

这是一个带有中间字典的简短解决方案：

x = ['hello', 'John', 'hi', 'John', 'hello', 'pumpum']

d = dict(zip(set(x),range(len(set(x)))))
y = [d[i] for i in x]

print(y)  # [2, 1, 0, 1, 2, 3]

注意：如果您不需要对数字标签进行排序，即不需要将0关联到x的第一项，将1关联到x的第二项，则可以使用它，等等...

在Patrick Artner发表评论后进行编辑：
他建议预先计算集合并将其存储为自己的变量，以进行优化，他是正确的。这是更新的代码：

x = ['hello', 'John', 'hi', 'John', 'hello', 'pumpum']

s = set(x)
d = dict(zip(s,range(len(s))))
y = [d[i] for i in x]

print(y)  # [2, 1, 0, 1, 2, 3]

数字标签的字符串列表

4 个答案: