Question

假设我有一个词汇表：['hello'，'how'，'are'，'you']。我有很多文本的语料库，例如：['hello'，'how'，'how']。是否有任何有效方法将此文本编码为整数列表，例如，如果我分配“ hello” = 1，“ how” = 2，“ are” = 3，“ you” = 4 ，那么我上面的文字将被编码为[1,2,2]。

我的环境：我必须编码一个约150,000文本的语料库。词汇量约为20万。通常，每个文本包含大约<200个单词。

我尝试了以下代码，但似乎效率不高。每个文本大约需要2秒钟，因此我需要8-9个小时才能完成。

tokens_to_index = [[vocabulary.index(word)+1 for word in text] for text in corpus]

Answer 1

尝试使用字典

vocabulary = dict(zip(vocabulary, range(1, len(vocabulary)+1) )) def tokens_to_index(corpus): return [[vocabulary[word] for word in text] for text in corpus]

Answer 2

我不确定，但可以尝试使用字典字典：您可以使用键：值对

如何有效地将单词序列编码为整数序列

2 个答案: