Question

这是另一个问题，但我认为最好将此问题作为一个单独的问题。提供大量句子（订单数量为10万）：

[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]

编写以下函数的最佳方法是什么？

def GetSentences(word1, word2, position):
    return ""

在给定两个单词word1，word2和位置position的情况下，该函数应返回满足该约束的所有句子的列表。例如：

GetSentences("sentence", "another", 3)

应返回句子1和3作为句子的索引。我目前的方法是使用这样的字典：

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))

for sentenceIndex, sentence in enumerate(sentences):
    words = sentence.split()
    for index, word in enumerate(words):
        for i, word2 in enumerate(words[index:):
            Index[word][word2][i+1].append(sentenceIndex)

但是，由于我的48GB RAM在不到5分钟的时间内耗尽，这很快就会导致大小约为130 MB的数据集中的所有内容。我不知何故感觉这是一个常见的问题，但无法找到有关如何有效解决这个问题的任何参考。关于如何处理这个的任何建议？

Answer 1

使用数据库存储值。

首先将所有句子添加到一个表（他们应该有ID）。你可以称它为例如。 sentences。
其次，在所有句子中包含创建包含单词的表格（称为words，为每个单词指定一个ID），保存句子表记录与单词表之间的连接单独表格中的记录（例如称为sentences_words，它应该有两列，最好是word_id和sentence_id）。
在搜索包含所有上述单词的句子时，您的工作将会简化：
1. 您应首先查找来自words表格的记录，其中的字词正是您搜索的字词。查询可能如下所示：
```
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
```
2. 其次，您应从表sentence_id 中找到sentences值，这些值需要word_id值（对应于{{1}中的字词}表）。初始查询可能如下所示：
```
words
```
  可以简化为：
```
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN ([here goes list of words' ids]);
```
3. 在Python中过滤结果，仅返回包含您需要的所有SELECT `sentence_id`, `word_id` FROM `sentences_words` WHERE `word_id` IN ( SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3') );个ID的sentence_id个值。

这基本上是一种基于以最适合的形式存储大量数据的解决方案 - 数据库。

修改

如果你只搜索两个单词，你可以在DBMS方面做更多（几乎所有的）。

考虑到你还需要位置差异，你应该将单词的位置存储在word_id表的第三列（让我们称之为sentences_words）中，并且当搜索适当的单词时，你应该计算差异与这两个词相关的这个值。

Answer 2

以下是我在Python中的表现。虽然假设需要多次执行此操作，但DBMS是正确的工具。然而，这似乎对我有100万行非常有用。

sentences = [
    "This is sentence 1 as an example",
    "This is sentence 1 as another example",
    "This is sentence 2",
    "This is sentence 3 as another example ",
    "This is sentence 4"
    ]

sentences = sentences * 200 * 1000

sentencesProcessed = []

def preprocess():
    global sentences
    global sentencesProcessed
    # may want to do a regex split on whitespace
    sentencesProcessed = [sentence.split(" ") for sentence in sentences]

    # can deallocate sentences now
    sentences = None


def GetSentences(word1, word2, position):
    results = []
    for sentenceIndex, sentence in enumerate(sentencesProcessed):
        for wordIndex, word in enumerate(sentence[:-position]):
            if word == word1 and sentence[wordIndex + position] == word2:
                results.append(sentenceIndex)
    return results

def main():
    preprocess()
    results = GetSentences("sentence", "another", 3)
    print "Got", len(results), "results"

if __name__ == "__main__":
    main()

索引文档中单词的最有效方法？

2 个答案: