Question

我试图用数字表示一组单词。到目前为止我有这个代码：

from sklearn.preprocessing import OneHotEncoder
import itertools
docs = ["select", "max", "income", "from", "data", "where", "revenue", "between", "20", "40"]

# split documents to tokens
tokens_docs = [doc.split(" ") for doc in docs]

# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_docs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}

然而，有一个限制 - 当令牌本身已经是数字时，我需要分配与数字相同的值（在word_to_id字典中）。有什么建议吗？

Answer 1

您可以在dict comprehension中添加一个条件，为了缩短它，请使用if表达式的简写：what_if_True if if_statement what_if_else。像这样：

word_to_id = {token: token if token.isdigit() else idx for idx, token in enumerate(set(all_tokens))}

Answer 2

您可以在字典理解中使用if else语句。

{token: idx if not token.isdigit() else int(token)
             for idx, token in enumerate(set(all_tokens)}

这将返回{'4': 4, '5': 5, 'df': 1, 'dfg': 4, 'fd': 0, 'fg': 3}
如果输入为['fd', 'df', '5', 'fg', 'dfg', '4']

Python字到id表示

2 个答案: