一热编码未观察到的字符列表

时间:2019-10-14 15:04:54

标签: python one-hot-encoding

我正在尝试为字符列表创建一个热编码(ohe),以允许未观察到的等级。使用Convert array of indices to 1-hot encoded numpy arrayFinding the index of an item given a list containing it in Python的答案,我确实想要以下内容:

# example data
# this is the full list including unobserved levels
av = list(map(chr, range(ord('a'), ord('z')+1))) 
# this is the vector to apply ohe
v = ['a', 'f', 'u'] 

# apply one hot encoding
ohe = np.zeros((len(v), len(av)))
for i in range(len(v)): ohe[i, av.index(v[i])] = 1
ohe

有没有更标准/更快的方法来执行此操作,请注意上面的第二个链接提到了.index()的瓶颈。

(我的问题的规模:完整向量(av)的电平约为1000,而ohe(v)的值的长度为0.5M。谢谢。

1 个答案:

答案 0 :(得分:1)

您可以使用查找字典:

.index

与在字典O(n)中查找相比,O(1)的复杂度为indices = [lookup[vi] for vi in v] ohe = np.zeros((len(v), len(av))) ohe[np.arange(len(v)), indices] = 1 。您甚至可以通过以下操作保存for循环:

import ast, operator

binOps = {
ast.Add: operator.add,
ast.Sub: operator.sub,
ast.Mult: operator.mul,
ast.Div: operator.truediv,
ast.Mod: operator.mod
}

unOps = {
ast.USub: operator.neg
}

node = ast.parse(s, mode='eval')

def arithmetic_eval(s):
    binOps = {
    ast.Add: operator.add,
    ast.Sub: operator.sub,
    ast.Mult: operator.mul,
    ast.Div: operator.truediv,
    ast.Mod: operator.mod
    }

    unOps = {
    ast.USub: operator.neg
    }

    node = ast.parse(s, mode='eval')

    def _eval(node):
        if isinstance(node, ast.Expression):
            return _eval(node.body)
        elif isinstance(node, ast.Str):
            return node.s
        elif isinstance(node, ast.Num):
            return node.n
        elif isinstance(node, ast.BinOp):
            return binOps[type(node.op)](_eval(node.left), _eval(node.right))
        elif isinstance(node, ast.UnaryOp):
            return unOps[type(node.op)](_eval(node.operand))
        else:
            raise Exception('Unsupported type {}'.format(node))

    return _eval(node.body)