在文档中构建倒置的句子列表及其各自的位置

时间:2017-03-26 22:04:37

标签: python inverted-index

我正在尝试使用Python在源文档中构建一个倒置的句子列表及其位置,并且失败了。

假设我有两份文件:

文档1

I like bananas. I don't like pears.

文档2

I don't like heights. I like bananas.

我正在尝试构建这些文档中的句子索引,如下所示:

Sent                   File[pos]    
I like bananas         Doc1[1], Doc2[2]
I don't like pears     Doc1[2]
I don't like heights   Doc2[1]

我找到了无数个构建倒置的单词列表和可以找到的文件的例子,但是找不到任何与构建句子索引有关的内容。

我已经尝试过处理传统词语索引的a piece of code on Github,但我显然错过了一些东西,因为我的黑客行为不起作用。

我正在使用的代码如下所示。我的代码与code mentioned above的主要区别在于我使用NLTK将文档标记为句子。

我的代码如下:

import nltk.data
import codecs
import os
import unicodedata


def sentence_split(text):
    sent_list = []
    scurrent = []
    sindex = None
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

    for i, c in enumerate(text):
        if c.isalnum():
            scurrent.append(c)
            sindex = i

        elif scurrent:

            scurrent_str = ''.join(map(str, scurrent))
            sentence_prep = ''.join(tokenizer.tokenize(scurrent_str))

            sentence = ''.join(sentence_prep)
            sent_list.append((sindex - len(sentence) + 1, sentence))

    if scurrent:

        scurrent_str = ''.join(map(str, scurrent))
        sentence_prep = ''.join(tokenizer.tokenize(scurrent_str))
        sentence = ''.join(sentence_prep)
        sent_list.append((sindex - len(sentence) + 1, sentence))

    return sent_list

def sentence_normalize(sentences):
    normalized_sentences = []
    for index, sentence in sentences:
        snormalized = sentence.lower()
        normalized_sentences.append((index, snormalized))
    return normalized_sentences

def sentence_index(text):
    sentences = sentence_split(text)
    sentences = sentence_normalize(sentences)
    return sentences

def inverted_index(text):
    inverted = {}

    for index, sentence in sentence_index(text):
        locations = inverted.setdefault(sentence, [])
        locations.append(index)

    return inverted

def inverted_index_add(inverted, doc_id, doc_index):

    for sentence, locations in doc_index.items():
        indices = inverted.setdefault(sentence, {})
        indices[doc_id] = locations
    return inverted

def search(inverted, query):

    sentences = [sentence for _, sentence in sentence_index(query) if sentence in inverted]
    results = [set(inverted[sentence].keys()) for sentence in sentences]
    return reduce(lambda x, y: x & y, results) if results else []

if __name__ == '__main__':
    doc1 = """
Niners head coach Mike Singletary will let Alex Smith remain his starting 
quarterback, but his vote of confidence is anything but a long-term mandate.
Smith now will work on a week-to-week basis, because Singletary has voided 
his year-long lease on the job.
"I think from this point on, you have to do what's best for the football team,"
Singletary said Monday, one day after threatening to bench Smith during a 
27-24 loss to the visiting Eagles.
"""

    doc2 = """
The fifth edition of West Coast Green, a conference focusing on "green" home 
innovations and products, rolled into San Francisco's Fort Mason last week 
intent, per usual, on making our living spaces more environmentally friendly 
- one used-tire house at a time.
To that end, there were presentations on topics such as water efficiency and 
the burgeoning future of Net Zero-rated buildings that consume no energy and 
produce no carbon emissions.
"""

    inverted = {}
    documents = {'doc1':doc1, 'doc2':doc2}
    for doc_id, text in documents.items():
        doc_index = inverted_index(text)
        inverted_index_add(inverted, doc_id, doc_index)

    for sentence, doc_locations in inverted.items():
        print (sentence, doc_locations)

    queries = ['I think from this point on, you have to do whats best for the football team,"Singletary said Monday, one day after threatening to bench Smith during a 27-24 loss to the visiting Eagles']
    for query in queries:
        result_docs = search(inverted, query)
        print("Search for '%s': %r" % (query, result_docs))
        for _, sentence in sentence_index(query):
            def extract_text(doc, index):
                return documents[doc][index:index+20].replace('\n', ' ')

            for doc in result_docs:
                for index in inverted[sentence][doc]:
                    print ('   - %s...' % extract_text(doc, index))

            print

以下是输出的片段:

niners {'doc1': [1]}
ninershead {'doc1': [2]}
ninersheadcoach {'doc1': [3]}
ninersheadcoachmike {'doc1': [4]}

1 个答案:

答案 0 :(得分:0)

这个怎么样?

In [167]: txt1
Out[167]: "I like bananas. I don't like pears."

In [168]: txt2
Out[168]: "I don't like heights. I like bananas."

In [169]: doc1 = tokenizer.tokenize(txt1)
In [170]: doc2 = tokenizer.tokenize(txt2)

In [171]: sent_doc = [(sent, "doc" + str(idx+1) + str([idxx+1]))  for idx, it in enumerate([doc1, doc2]) for idxx, sent in enumerate(it)]

In [172]: sent_doc
Out[172]: 
[('I like bananas.', 'doc1[1]'),
 ("I don't like pears.", 'doc1[2]'),
 ("I don't like heights.", 'doc2[1]'),
 ('I like bananas.', 'doc2[2]')]

现在,构建字典。

In [176]: dict_ = defaultdict()

In [177]: for sent, doc in sent_doc:
     ...:     if sent in dict_:
     ...:         dict_[sent] = dict_[sent] + [doc]
     ...:     else:
     ...:         dict_[sent] = [doc]

# output
In [178]: dict_
Out[178]: 
defaultdict(None,
            {"I don't like heights.": ['doc2[1]'],
             "I don't like pears.": ['doc1[2]'],
             'I like bananas.': ['doc1[1]', 'doc2[2]']})