Question

给出词汇["NY", "LA", "GA"]，如何编码它成为：

"NY" = 100
"LA" = 010
"GA" = 001

因此，如果我对"NY GA"进行查找，我会得到101

Answer 1

vocab = ["NY", "LA", "GA"]
categorystring = '0'*len(vocab)
selectedVocabs = 'NY GA'
for sel in selectedVocabs.split():
    categorystring = list(categorystring)
    categorystring[vocab.index(sel)] = '1'
    categorystring = ''.join(categorystring)

这是我赢得测试的最终结果，原来Python不支持字符串项目分配，不知怎的，我认为它确实如此。

我个人认为behzad的解决方案更好，numpy做得更好，速度更快。

Answer 2

您可以使用numpy.in1d：

>>> xs = np.array(["NY", "LA", "GA"])
>>> ''.join('1' if f else '0' for f in np.in1d(xs, 'NY GA'.split(' ')))
'101'

或：

>>> ''.join(np.where(np.in1d(xs, 'NY GA'.split(' ')), '1', '0'))
'101'

Answer 3

或者你可以

    vocabulary = ["NY","LA","GA"]


    i=pow(10,len(vocabulary)-1)
    dictVocab = dict()

    for word in vocabulary:
       dictVocab[word] = i
       i /= 10

    yourStr = "NY LA"
    result = 0
    for word in yourStr.split():
       result += dictVocab[word]

Answer 4

另一种使用numpy的解决方案。看起来你要对字典进行二进制编码，所以下面的代码对我来说很自然。

import numpy as np

def to_binary_representation(your_str="NY LA"):
    xs = np.array(["NY", "LA", "GA"])
    ys = 2**np.arange(3)[::-1]
    lookup_table = dict(zip(xs,ys))

    return bin(np.sum([lookup_table[k] for k in your_str.split()]))

它也不需要在numpy中执行它，但如果你有大型数组可以使用它可能会更快。 np.sum可以替换为内置sum，xs和ys可以转换为非numpy等价物。

Answer 5

要创建查找字典，请反转词汇表，枚举它，并使用2的强大功能：

>>> vocabulary = ["NY", "LA", "GA"]
d = dict((word, 2 ** i) for i, word in enumerate(reversed(vocabulary)))
>>> d
{'NY': 4, 'GA': 1, 'LA': 2}

查询字典：

>>> query = "NY GA"
>>> sum(code for word, code in d.iteritems() if word in query.split())
5

如果您希望将其格式化为二进制文件：

>>> '{0:b}'.format(5)
'101'

编辑：如果你想要一个班轮＆＃39;：

>>> '{0:b}'.format(
        sum(2 ** i
            for i, word in enumerate(reversed(vocabulary))
            if word in query.split()))
'101'

edit2：如果你想要填充，例如六位＆＃39;：

>>> '{0:06b}'.format(5)
'000101'

如何在Python中编码分类值

5 个答案: