Question

我写了一个小的'搜索引擎'，它找到目录及其子目录中的所有文本文件 - 我可以编辑代码，但我不认为这对我的问题是必要的。

通过以下格式创建字典：

term_frequency = {'file1' : { 'WORD1' : 1, 'WORD2' : 2, 'WORD3' : 3}}
                 {'file2' : { 'WORD1' : 1, 'WORD3' : 3, 'WORD4' : 4}}
                 ...continues with all the files it has found...

从收集的信息中，它创建了第二个字典，如：

document_frequency = {'WORD1' : ['file1', 'file2'....],
                      'WORD2' : ['file1',............],
                        ....every found word..........]}

term_frequency dictionary的目的是保存该文件中单词使用次数的数据，并document_frequency说明该单词使用了多少文档。

然后，当给出一个单词时，它会按tf/df计算每个文件的相关性，并按文件的降序相关性列出non-zero值。

例如：

file1 : 0.75
file2 : 0.5

我知道这是tf-idf的一个非常简单的表示，但我是python和编程的新手（2周），并且我已经熟悉了它。

对于长篇介绍感到抱歉，但我觉得这与这个问题有关......这让我对它有所了解：

如何创建一个将这些词典保存在文件中的索引器，然后让“搜索者”从文件中读取这些词典。因为现在的问题是，每次你想要查找不同的单词时，它必须再次读取所有文件并反复制作相同的2个词典。

Answer 1

The pickle (and for that matter cPickle)图书馆是你的朋友。通过使用pickle.dump（），您可以将整个字典转换为一个文件，稍后可以通过pickle.load（）读回。

在这种情况下，您可以使用以下内容：

import pickle
termfile = open('terms.pkl', 'wb')
documentfile = open('documents.pkl', 'wb')
pickle.dump(term_frequency, termfile)
pickle.dump(document_frequency, documentfile)
termfile.close()
documentfile.close()

并按原样阅读：

termfile = open('terms.pkl', 'rb')
documentfile = open('documents.pkl', 'rb')
term_frequency = pickle.load(termfile)
document_frequency = pickle.load(documentfile)
termfile.close()
documentfile.close()

Python 2.7：持久搜索和索引

1 个答案: