Question

关于如何从字符串构建字典的问题比Creating a dictionary from a string更倾向于语言/ NLP倾斜

给定一个字符串句子列表，是否有更简单的方法来构建一个独特的单词字典，然后对字符串句子进行矢量化？我知道有外部库可以像{{1}那样执行此操作但我想避免它们。我一直这样做：

gensim

Answer 1

下面：

from itertools import chain, count

s1 = "this is is a foo"
s2 = "this is a a bar"
s3 = "that 's a foobar"

# convert each sentence into a list of words, because the lists
# will be used twice, to build the dictionary and to vectorize
w1, w2, w3 = all_ws = [s.split() for s in [s1, s2, s3]]

# chain the lists and turn into a set, and then a list, of unique words
index_to_word = list(set(chain(*all_ws)))

# build the inverse mapping of index_to_word, by pairing it with a counter
word_to_index = dict(zip(index_to_word, count()))

# create the vectors of word indices and of word count for each sentence
v1 = [(word_to_index[word], w1.count(word)) for word in w1]
v2 = [(word_to_index[word], w2.count(word)) for word in w2]
v3 = [(word_to_index[word], w3.count(word)) for word in w3]

print v1
print v2
print v3

要记住的事情：

字典应该只从一个键到另一个值;如果你需要做相反的事情，创建（并保持更新）两个字典，一个是另一个的反向映射，如上所述;
如果您需要一个键是连续整数的字典，只需使用一个列表（谢谢Jeff）;
从不计算两次相同的东西！（参见句子的split（）版本）如果以后需要，请将其保存在变量中;
尽可能使用列表推导，以提高性能，简洁性和可读性。

Answer 2

如果您尝试计算句子中单词的出现次数，请使用collections.Counter

您的代码存在问题：

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?
dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

上面的部分只是创建一个由任意数字索引的字典（来自迭代没有索引概念的set）。然后使用

访问上面的字典

def getKey(dic, value):
  return [k for k,v in sorted(dic.items()) if v == value]

这个函数，也完全忽略了dict的精神：你用键进行查找，而不是值。

同样，vectorize的想法还不清楚。你想通过这个功能实现什么目标？你问了一个更简单的vectorize版本，却没有告诉我们它的作用。

Answer 3

您的代码中有多个问题，让我们逐个回答。

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?

首先，它可能在概念上更简单（虽然同样冗长）split()字符串独立，而不是将它们连接在一起然后分割结果。

uniq = list(set(chain(*map(str.split, (s1, s2, s3))))

除此之外：看起来你总是使用单词列表，而不是实际的句子，所以你在多个地方分裂。为什么不立刻将它们全部拆分，在顶部？

与此同时，为什么不将它们粘贴在一个集合中，而不是明确传递s1，s2和s3？你也可以将结果粘贴在一个集合中。

所以：

sentences = (s1, s2, s3)
wordlists = [sentence.split() for sentence in sentences]

uniq = list(set(chain.from_iterable(wordlists)))

# ...

vectors = [vectorize(sentence, dictionary) for sentence in sentences]
for vector in vectors:
    print vector

dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

你可以在列表理解上以dict()来做 - 但更简单地说，使用dict理解。而且，在您使用时，请使用enumerate代替for i in range(len(uniq))位。

dictionary = {idx: word for (idx, word) in enumerate(uniq)}

取代上面的整个# ...部分。

同时，如果你想要一个反向字典查找，这不是这样做的方法：

def getKey(dic, value):
    return [k for k,v in sorted(dic.items()) if v == value]

相反，创建一个反向字典，将值映射到键列表。

def invert_dict(dic):
    d = defaultdict(list)
    for k, v in dic.items():
        d[v].append(k)
    return d

然后，而不是你的getKey函数，只需在倒置字典中进行正常查找。

如果您需要替换修改和查找，您可能需要某种双向字典，它可以管理自己的逆字典。在ActiveState上有很多关于这样的事情的方法，PyPI上可能有一些模块，但是自己构建起来并不困难。无论如何，你似乎不需要这里。

最后，还有你的vectorize功能。

如上所述，要做的第一件事就是采用单词列表而不是句子进行拆分。

并且没有理由在lower之后重新分割句子;只需在单词列表中使用地图或生成器表达式。

事实上，当你的字典是用原始版本构建的时候，我不确定你为什么在这里做lower。我猜这是一个错误，你也想在构建字典时做lower。这是在单个易于查找的地方提前制作单词列表的优势之一：您只需更改一行：

wordlists = [sentence.lower().split() for sentence in sentences]

现在你已经有点简单了：

def vectorize(wordlist, dictionary):
    vector = []
    for word in wordlist:
        word_count = wordlist.count(word)
        dic_pos = getKey(dictionary, word)[0]
        vector.append((dic_pos,word_count))
    return vector

同时，您可能会发现vector = []… for word in wordlist… vector.append正是列表理解的目的。但是，如何将三行代码转换为列表理解？容易：将其重构为一个功能。所以：

def vectorize(wordlist, dictionary):
    def vectorize_word(word):
        word_count = wordlist.count(word)
        dic_pos = getKey(dictionary, word)[0]
        return (dic_pos,word_count)
    return [vectorize_word(word) for word in wordlist]

Answer 4

好吧，看起来你想要：

返回每个标记位置值的字典。
计算集合中令牌的次数。

你可以：

import bisect

uniq.sort() #Sort it since order didn't seem to matter

def getPosition(value):
    position = bisect.bisect_left(uniq, value) #Do a log(n) query
    if uniq[position] != value:
        raise IndexError

要在O（n）时间内搜索，您可以创建设置并使用顺序键迭代插入每个值。这在内存上的效率要低得多，但是它通过哈希提供O（n）搜索......在我写作的时候，Tobia发布了一个很好的代码示例，所以看看答案。

是否有更简单的方法从字符串构建字典，然后矢量化字符串？蟒蛇

4 个答案: