也许我的意思不是那么多"双重打击"但是"近似字符串匹配"或"模糊文本匹配"?基本上我想要一个像grep一样的程序,但是搜索长度为X的短语,这些短语在两个文本中以大致相同的形式出现,并返回这些短语及其原始上下文。这就是我到目前为止所拥有的。 (它也在this ipython notebook, with sample output上。
import nltk
import re
from nltk.util import ngrams
from difflib import SequenceMatcher
from string import punctuation
from termcolor import colored
from fuzzysearch import find_near_matches
class Matcher:
def __init__(self, fileA, fileB, threshold, ngramSize):
"""
Gets the texts from the files, tokenizes them,
cleans them up as necessary.
"""
self.threshold = threshold
self.filenameA = fileA
self.filenameB = fileB
self.textA = self.readFile(fileA)
self.textB = self.readFile(fileB)
textATokens = self.tokenize(self.textA)
textBTokens = self.tokenize(self.textB)
self.textAgrams = list(ngrams(textATokens, ngramSize))
self.textBgrams = list(ngrams(textBTokens, ngramSize))
def readFile(self, filename):
""" Reads the file in memory. """
return open(filename).read()
def tokenize(self, text):
""" Tokenizes the text, breaking it up into words. """
return nltk.word_tokenize(text.lower())
def gramsToString(self, grams):
"""
Takes a list of tuples (3-grams, 4-grams, etc.)
and stitches it back together into a string, so that
we can search the non-tokenized text for the string later.
"""
string = " ".join(grams[0][:-1])
for gram in grams:
lastGram = gram[-1]
if lastGram not in punctuation:
string += " " + lastGram
else:
string += lastGram
return string
def getMatch(self, match, textA, textB):
"""
Takes the match object returned by get_matching_blocks() and
gets the matched n-gram. It uses gramsToString() to
reformat this into a string.
"""
textAs, textBs = [], []
for i in range(match.size):
textAs.append(textA[match.a+i])
textBs.append(textB[match.b+i])
return (self.gramsToString(textAs), self.gramsToString(textBs))
def match(self):
"""
This does the main work of finding matching n-gram sequences between
the texts.
"""
sequence = SequenceMatcher(None,self.textAgrams,self.textBgrams)
matchingBlocks = sequence.get_matching_blocks()
# Only return the matching sequences that are higher than the
# threshold given by the user.
highMatchingBlocks = [match for match in matchingBlocks if match.size > self.threshold]
for match in highMatchingBlocks:
out = self.getMatch(match, self.textAgrams, self.textBgrams)
print('\n', out)
self.findInText(out[0], self.textA, self.filenameA, 20)
self.findInText(out[1], self.textB, self.filenameB, 20)
def findInText(self, needle, haystack, haystackName, context):
"""
This takes the matches found by match() and tries to find that match
again in the text, so that we can return some context. Uses the
fuzzysearch library, because I couldn't find anything better.
"""
m = find_near_matches(needle, haystack, max_l_dist=2)
if len(m) > 0:
m = m[0] # just get the first match for now. TODO: get all of them
before = haystack[m.start-context:m.start]
match = colored(haystack[m.start:m.end], 'red')
after = haystack[m.end:m.end+context]
contextualized = before + match + after
cleaned = re.sub( '\s+', ' ', contextualized ).strip()
print(colored(haystackName, 'green') + ": " + cleaned)
else:
print('Couldn\'t find this match in file: ', haystackName)
用法:
myMatch = Matcher('milton.txt', 'kjv.txt', 2, 3)
myMatch.match()
它主要起作用,但它很尴尬,因为它必须1)找到一个带有n-gram的匹配字符串,并且2)找出文本中那些n-gram来自哪里。有时程序会找到一个匹配项,但是它无法找出匹配的原始文本中的哪个位置。如果有一种方法可以直接搜索文件本身,而不是搜索n-gram列表,那么整个过程就可以简化了。有没有办法做到这一点?或者,有没有办法将n-gram列表与原始文本中的位置相关联?