Question

有一些文档要编入索引，这意味着我需要阅读文档并提取文字并通过存储它们出现在哪个文档和哪个位置来索引它们。

对于每个单词，我最初创建一个单独的文件。考虑2个文件：

文件1

The Problem of Programming Communication with

文件2

Programming of Arithmetic Operations

所以会有10个单词，8个独特。所以我创建了8个文件。

的问题的程序设计通讯同算术操作

在每个文件中，我将存储它们出现在哪个文档和哪个位置。我实施的实际结构有更多的信息，但这个基本结构将有助于实现目的。

文件名文件内容

1 1

问题1 2

of 1 3 2 2

编程1 4 2 1

通讯1 5

1 6

算术2 3

操作2 4

含义。单词位于第1个文档 - 第3个位置和第2个文档 - 第2个位置。

初始索引完成后，我将所有文件连接到一个索引文件中，在另一个文件中，我将偏移量存储在特定单词的位置。

索引文件：

1 1 1 2 1 3 2 2 1 4 2 1 1 5 1 6 2 3 2 4

偏移文件：

the 1 problem 3 of 5 programming 9 communications 13  with 15 arithmetic 17 operations 19

因此，如果我需要通信的索引信息，我将转到文件的第13个位置并读取到（不包括）第15个位置，换句话说，读取下一个字的偏移量。

这对于静态索引来说都很好。但是，如果我更改单个索引，则需要重写整个文件。我可以使用b树作为索引文件的结构，这样我可以动态更改文件内容并以某种方式更新偏移量吗？如果是这样，有人可以指导我一些教程或库如何工作，或解释一下如何实现这个？

非常感谢您花时间阅读这么长的帖子。

编辑：我不知道B树和二叉树之间的区别。所以我最初使用二叉树问了这个问题。它现在已修好。

Answer 1

基本上你正在尝试构建倒排索引。为什么有必要使用这么多文件？您可以使用持久对象和词典为您完成工作。稍后，当索引发生更改时，您只需重新加载持久对象并更改给定条目并重新保存该对象。

以下是执行此操作的示例代码：

import shelve

DOC1 = "The problem of Programming Communication with"
DOC2 = "Programming of Arithmetic Operations"

DOC1 = DOC1.lower()
DOC2 = DOC2.lower()

all_words = DOC1.split()
all_words.extend(DOC2.split())
all_words = set(all_words)

inverted_index = {}

def location(doc, word):
    return doc[:doc.find(word)].count(' ') + 1


for word in all_words:
    if word in DOC1:
        if word in inverted_index:
            inverted_index[word].append(('DOC1', location(DOC1, word)))
        else:
            inverted_index[word] = [('DOC1', location(DOC1, word))]
    if word in DOC2:
        if word in inverted_index:
            inverted_index[word].append(('DOC2', location(DOC2, word)))
        else:
            inverted_index[word] = [('DOC2', location(DOC2, word))]

# Saving to persistent object
inverted_index_file = shelve.open('temp.db')
inverted_index_file['1'] = inverted_index
inverted_index_file.close()

然后你可以看到这样保存的对象（你可以使用相同的策略修改它）：

>>> import shelve
>>> t = shelve.open('temp.db')['1']
>>> print t
{'operations': [('DOC2', 4)], 'of': [('DOC1', 3), ('DOC2', 2)], 'programming': [('DOC1',   4), ('DOC2', 1)], 'communication': [('DOC1', 5)], 'the': [('DOC1', 1)], 'with': [('DOC1', 6)], 'problem': [('DOC1', 2)], 'arithmetic': [('DOC2', 3)]}

我的观点是，一旦你构建了这个，而你的其他代码正在运行，你可以将内存中的shelve对象作为字典并动态地进行更改。

如果它不适合您，那么我会支持使用数据库，尤其是sqlite3因为它是轻量级的。

Answer 2

一种选择是使用dicts构建数据并使用cPickle将其转储到文件中。

使用python在文件中编写b-tree

2 个答案: