Question

我正在尝试构建一个倒排索引，即将文本映射到它来自的文档。它在列表/文档中的位置。

在我的情况下，我已经解析了包含列表的列表（即列表列表）。

我的意见是这样的。

        [
        ['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
        ['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
        ['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'], 
        ['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
        ['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
        ]

这是我的代码

def create_inverted(mylists):
    myDict = {}
    for sublist in mylists: 
        for i in range(len(sublist)):
            if sublist[i] in myDict:
                myDict[sublist[i]].append(i)
            else:
                myDict[sublist[i]] = [i]

    return myDict

它确实构建了字典，但是当我进行搜索时，我得不到正确的字典结果。我正在尝试这样做。

documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]

index = {'owl': [0, 2],
         'lion': [0, 1],  # IDs are sorted. 
         'deer': [1],
         'leopard': [2]}

def indexed_search(documents, index, query):
   return [documents[doc_id] for doc_id in index[query]]

print indexed_search(documents, index, 'lion')

我可以输入搜索文本，然后获取列表ID。

任何想法。

Answer 1

您将每个单词映射到每个文档中找到的位置，而不是找到它所在的位置。您应该将索引存储到文档列表中而不是索引到文档本身中，或者可能只是映射单词直接而不是索引文件：

def create_inverted_index(documents):
    index = {}
    for i, document in enumerate(documents):
        for word in set(document):
            if word in index:
                index[word].append(i)
            else:
                index[word] = [i]
    return index

大部分内容与您的代码相同。主要区别在于这两行：

    for i, document in enumerate(documents):
        for word in set(document):

对应于代码的以下部分：

    for sublist in mylists: 
        for i in range(len(sublist)):

enumerate遍历序列的索引和元素。由于enumerate位于外部循环中，因此我的代码中的i是文档的索引，而代码中的i是文档中单词的索引。

set(document)在文档中创建set个单词，每个单词只出现一次。这样可以确保每个单词只对每个文档计数一次，而不是在文档2中出现2 10次时'Cheetos'列表中出现'Cheetos' 10次。

Answer 2

首先，我会提取所有可能的单词并将它们存储在一个set中。然后我查找每个列表中的每个单词并收集单词恰好位于...的列表的所有索引。

source = [
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'], 
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]

allWords = set(word for lst in source for word in lst)

wordDict = { word: [
                    i for i, lst in enumerate(source) if word in lst
                    ] for word in allWords }

print wordDict
Out[30]: 
{'a': [1, 2, 3],
 'afraid': [3],
 'always': [1, 4],
 'and': [2],
 ...

Answer 3

我会将索引累积到一个集合中以避免重复，然后排序

>>> documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]
>>> from collections import defaultdict
>>> D = defaultdict(set)
>>> for i, doc in enumerate(documents):
...     for word in doc:
...         D[word].add(i)
... 
>>> D ## Take a look at the defaultdict
defaultdict(<class 'set'>, {'owl': {0, 2}, 'leopard': {2}, 'lion': {0, 1}, 'deer': {1}})
>>> {k:sorted(v) for k,v in D.items()}
{'lion': [0, 1], 'owl': [0, 2], 'leopard': [2], 'deer': [1]}

Answer 4

只要您不需要有效的代码，这很简单：

documents = [['owl', 'lion'], ['lion', 'deer'], ['owl', 'leopard']]

def index(docs):
    doc_index = {}
    for doc_id, doc in enumerate(docs, 1):
        for term_pos, term in enumerate(doc, 1):
            doc_index.setdefault(term, {}).setdefault(doc_id, []).append(term_pos)
    return doc_index

现在您获得一个两级字典，可以访问文档ID，然后访问本文档中术语的位置：

>>> index(documents)
{'lion': {1: [2], 2: [1]}, 'leopard': {3: [2]}, 'deer': {2: [2]}, 'owl': {1: [1], 3: [1]}}

这只是索引的初步步骤;之后，您需要将术语字典与职位发布中的文档发布分开。通常，字典存储在树状结构中（有Python包），文档发布和位置发布表示为无符号整数数组。

从列表列表构建字典

4 个答案: