这是另一个问题,但我认为最好将此问题作为一个单独的问题。提供大量句子(订单数量为10万):
[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]
编写以下函数的最佳方法是什么?
def GetSentences(word1, word2, position):
return ""
在给定两个单词word1
,word2
和位置position
的情况下,该函数应返回满足该约束的所有句子的列表。例如:
GetSentences("sentence", "another", 3)
应返回句子1
和3
作为句子的索引。我目前的方法是使用这样的字典:
Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))
for sentenceIndex, sentence in enumerate(sentences):
words = sentence.split()
for index, word in enumerate(words):
for i, word2 in enumerate(words[index:):
Index[word][word2][i+1].append(sentenceIndex)
但是,由于我的48GB RAM在不到5分钟的时间内耗尽,这很快就会导致大小约为130 MB的数据集中的所有内容。我不知何故感觉这是一个常见的问题,但无法找到有关如何有效解决这个问题的任何参考。关于如何处理这个的任何建议?
答案 0 :(得分:14)
使用数据库存储值。
sentences
。words
,为每个单词指定一个ID),保存句子表记录与单词表之间的连接单独表格中的记录(例如称为sentences_words
,它应该有两列,最好是word_id
和sentence_id
)。在搜索包含所有上述单词的句子时,您的工作将会简化:
您应首先查找来自words
表格的记录,其中的字词正是您搜索的字词。查询可能如下所示:
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
其次,您应从表sentence_id
中找到sentences
值,这些值需要word_id
值(对应于{{1}中的字词}表)。初始查询可能如下所示:
words
可以简化为:
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN ([here goes list of words' ids]);
在Python中过滤结果,仅返回包含您需要的所有SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN (
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3')
);
个ID的sentence_id
个值。
这基本上是一种基于以最适合的形式存储大量数据的解决方案 - 数据库。
修改强>
word_id
表的第三列(让我们称之为sentences_words
)中,并且当搜索适当的单词时,你应该计算差异与这两个词相关的这个值。答案 1 :(得分:2)
以下是我在Python中的表现。虽然假设需要多次执行此操作,但DBMS是正确的工具。然而,这似乎对我有100万行非常有用。
sentences = [
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]
sentences = sentences * 200 * 1000
sentencesProcessed = []
def preprocess():
global sentences
global sentencesProcessed
# may want to do a regex split on whitespace
sentencesProcessed = [sentence.split(" ") for sentence in sentences]
# can deallocate sentences now
sentences = None
def GetSentences(word1, word2, position):
results = []
for sentenceIndex, sentence in enumerate(sentencesProcessed):
for wordIndex, word in enumerate(sentence[:-position]):
if word == word1 and sentence[wordIndex + position] == word2:
results.append(sentenceIndex)
return results
def main():
preprocess()
results = GetSentences("sentence", "another", 3)
print "Got", len(results), "results"
if __name__ == "__main__":
main()