Question

我正在尝试构造一个名为“ and_query”的函数，该函数将包含一个或多个单词的单个字符串作为输入，以便该函数返回与文档摘要中的单词匹配的文档列表。

首先，我将所有单词放入一个倒排索引中，其id是文档的ID，摘要是纯文本。

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
    inverted_index[term].add(id)

然后，我编写了一个查询函数，其中finals是所有匹配文档的列表。

因为它只应返回文档中函数参数的每个单词都匹配的文档，所以我使用了设置操作“ intersecton”。

def and_query(tokens):
    documents=set()
    finals = []
    terms = preprocess(tokenize(tokens))

    for term in terms:
        for i in inverted_index[term]:
            documents.add(i)

    for term in terms:
        temporary_set= set()
        for i in inverted_index[term]:
            temporary_set.add(i)
        finals.extend(documents.intersection(temporary_set))
    return finals

def finals_print(finals):
    for final in finals:
        display_summary(final)        

finals_print(and_query("netherlands vaccine trial"))

但是，该函数似乎仍在返回文档摘要中只有1个单词的文档。

有人知道我在设置操作上做错了什么吗？

（我认为错误应该在代码的这一部分中的任何地方）：

for term in terms:
    temporary_set= set()
    for i in inverted_index[term]:
        temporary_set.add(i)
    finals.extend(documents.intersection(temporary_set))
return finals

预先感谢

简而言之，我想做什么：

for word in words:
    id_set_for_one_word= set()
    for  i  in  get_id_of that_word[word]:
        id_set_for_one_word.add(i)
pseudo:
            id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)

然后我需要所有这些单词的id集合的交集，返回单词中每个单词都存在的id的集合。

Answer 1

要详细说明我的代码，请注意以下内容，这是我为解决此类问题所做的工作的粗略草稿。

def tokenize(abstract):
    #return <set of words in abstract>
    set_ = .....
    return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

    search_results = []
    for cand in candidates:
        #cand[2] has a set of tokens or somesuch...  abstract.
        if criteria in cand[2]:
            if match_on_found:
                search_results.append(cand)
            else:
                #that's a AND NOT if you wanted that
                search_results.append(cand)
    return search_results


for criteria in all_criterias:
    #pass in the full list every time, but it gets progressively shrunk
    candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]

Answer 2

问题：返回与文档摘要中的单词匹配的文档列表

编号为{{1}的term的{{1}}始终保持min。
如果documents中不存在result，则完全不匹配。

为简单起见，预定义的数据：

term

输出：
inverted_index

使用Python测试：3.4.2

Answer 3

最终我自己找到了解决方案。替换

    finals.extend(documents.intersection(id_set_for_one_word))
return finals

与

    documents = (documents.intersection(id_set_for_one_word))
return documents

似乎在这里工作。

还是，谢谢大家的努力。

编写AND查询以查找数据集中的匹配文档（python）

3 个答案: