我正在尝试为字符列表创建一个热编码(ohe),以允许未观察到的等级。使用Convert array of indices to 1-hot encoded numpy array和Finding the index of an item given a list containing it in Python的答案,我确实想要以下内容:
# example data
# this is the full list including unobserved levels
av = list(map(chr, range(ord('a'), ord('z')+1)))
# this is the vector to apply ohe
v = ['a', 'f', 'u']
# apply one hot encoding
ohe = np.zeros((len(v), len(av)))
for i in range(len(v)): ohe[i, av.index(v[i])] = 1
ohe
有没有更标准/更快的方法来执行此操作,请注意上面的第二个链接提到了.index()
的瓶颈。
(我的问题的规模:完整向量(av)的电平约为1000,而ohe(v)的值的长度为0.5M。谢谢。
答案 0 :(得分:1)
您可以使用查找字典:
.index
与在字典O(n)
中查找相比,O(1)
的复杂度为indices = [lookup[vi] for vi in v]
ohe = np.zeros((len(v), len(av)))
ohe[np.arange(len(v)), indices] = 1
。您甚至可以通过以下操作保存for循环:
import ast, operator
binOps = {
ast.Add: operator.add,
ast.Sub: operator.sub,
ast.Mult: operator.mul,
ast.Div: operator.truediv,
ast.Mod: operator.mod
}
unOps = {
ast.USub: operator.neg
}
node = ast.parse(s, mode='eval')
def arithmetic_eval(s):
binOps = {
ast.Add: operator.add,
ast.Sub: operator.sub,
ast.Mult: operator.mul,
ast.Div: operator.truediv,
ast.Mod: operator.mod
}
unOps = {
ast.USub: operator.neg
}
node = ast.parse(s, mode='eval')
def _eval(node):
if isinstance(node, ast.Expression):
return _eval(node.body)
elif isinstance(node, ast.Str):
return node.s
elif isinstance(node, ast.Num):
return node.n
elif isinstance(node, ast.BinOp):
return binOps[type(node.op)](_eval(node.left), _eval(node.right))
elif isinstance(node, ast.UnaryOp):
return unOps[type(node.op)](_eval(node.operand))
else:
raise Exception('Unsupported type {}'.format(node))
return _eval(node.body)