在Python中双击两个文件?

时间:2016-02-25 14:45:42

标签: python nlp nltk

也许我的意思不是那么多"双重打击"但是"近似字符串匹配"或"模糊文本匹配"?基本上我想要一个像grep一样的程序,但是搜索长度为X的短语,这些短语在两个文本中以大致相同的形式出现,并返回这些短语及其原始上下文。这就是我到目前为止所拥有的。 (它也在this ipython notebook, with sample output上。

import nltk
import re
from nltk.util import ngrams
from difflib import SequenceMatcher
from string import punctuation
from termcolor import colored
from fuzzysearch import find_near_matches

class Matcher: 
    def __init__(self, fileA, fileB, threshold, ngramSize):
        """
        Gets the texts from the files, tokenizes them, 
        cleans them up as necessary. 
        """
        self.threshold = threshold

        self.filenameA = fileA
        self.filenameB = fileB

        self.textA = self.readFile(fileA)
        self.textB = self.readFile(fileB)

        textATokens = self.tokenize(self.textA)
        textBTokens = self.tokenize(self.textB)

        self.textAgrams = list(ngrams(textATokens, ngramSize))
        self.textBgrams = list(ngrams(textBTokens, ngramSize))

    def readFile(self, filename): 
        """ Reads the file in memory. """
        return open(filename).read()

    def tokenize(self, text): 
        """ Tokenizes the text, breaking it up into words. """
        return nltk.word_tokenize(text.lower())

    def gramsToString(self, grams): 
        """
        Takes a list of tuples (3-grams, 4-grams, etc.) 
        and stitches it back together into a string, so that
        we can search the non-tokenized text for the string later. 
        """
        string = " ".join(grams[0][:-1])
        for gram in grams:
            lastGram = gram[-1]
            if lastGram not in punctuation: 
                string += " " + lastGram
            else: 
                string += lastGram
        return string

    def getMatch(self, match, textA, textB): 
        """ 
        Takes the match object returned by get_matching_blocks() and
        gets the matched n-gram. It uses gramsToString() to 
        reformat this into a string.
        """
        textAs, textBs = [], []
        for i in range(match.size):
            textAs.append(textA[match.a+i])
            textBs.append(textB[match.b+i])
        return (self.gramsToString(textAs), self.gramsToString(textBs))

    def match(self): 
        """
        This does the main work of finding matching n-gram sequences between
        the texts.
        """
        sequence = SequenceMatcher(None,self.textAgrams,self.textBgrams)
        matchingBlocks = sequence.get_matching_blocks()

        # Only return the matching sequences that are higher than the 
        # threshold given by the user. 
        highMatchingBlocks = [match for match in matchingBlocks if match.size > self.threshold]

        for match in highMatchingBlocks: 
            out = self.getMatch(match, self.textAgrams, self.textBgrams)
            print('\n', out)
            self.findInText(out[0], self.textA, self.filenameA, 20)
            self.findInText(out[1], self.textB, self.filenameB, 20)

    def findInText(self, needle, haystack, haystackName, context):
        """
        This takes the matches found by match() and tries to find that match
        again in the text, so that we can return some context. Uses the
        fuzzysearch library, because I couldn't find anything better.
        """
        m = find_near_matches(needle, haystack, max_l_dist=2)

        if len(m) > 0: 
            m = m[0] # just get the first match for now. TODO: get all of them

            before = haystack[m.start-context:m.start]
            match  = colored(haystack[m.start:m.end], 'red')
            after  = haystack[m.end:m.end+context]    

            contextualized = before + match + after
            cleaned = re.sub( '\s+', ' ', contextualized ).strip()
            print(colored(haystackName, 'green') + ": " + cleaned)
        else: 
            print('Couldn\'t find this match in file: ', haystackName)

用法:

myMatch = Matcher('milton.txt', 'kjv.txt', 2, 3)
myMatch.match()

它主要起作用,但它很尴尬,因为它必须1)找到一个带有n-gram的匹配字符串,并且2)找出文本中那些n-gram来自哪里。有时程序会找到一个匹配项,但是它无法找出匹配的原始文本中的哪个位置。如果有一种方法可以直接搜索文件本身,而不是搜索n-gram列表,那么整个过程就可以简化了。有没有办法做到这一点?或者,有没有办法将n-gram列表与原始文本中的位置相关联?

0 个答案:

没有答案