在拼写检查器中如何获得3个编辑的单词(norvig)

时间:2017-06-05 11:36:12

标签: python sql-server nlp spell-checking spelling

我一直在尝试为我的数据库表使用拼写纠正器来纠正一个表中的地址,我使用了http://norvig.com/spell-correct.html的引用 使用 Address_mast 表作为字符串的集合,我正在尝试更正并更新“ customer_master ”中的更正字符串

Address_mast

ID        Address
1    sonal plaza,harley road,sw-309012
2    rose apartment,kell road, juniper, la-293889
3    plot 16, queen's tower, subbden - 399081
4    cognizant plaza, abs road, ziggar - 500234

现在从参考代码开始,它仅针对那些“远离单词两次编辑”的单词。但我正在尝试3或4,同时尝试更新那些更正的单词到其他table.here表包含拼写错误的单词,并用更正的单词更新

Customer_master

Address_1

josely apartmt,kell road, juneeper, la-293889
zoonal plaza, harli road,sw-309012
plot 16, queen's tower, subbden - 399081
cognejantt pluza, abs road, triggar - 500234

这是我尝试过的事情

import re
import pyodbc
import numpy as np
from collections import Counter

cnxn = pyodbc.connect('DRIVER={SQLServer};SERVER=localhost;DATABASE=DBM;UID=ADMIN;PWD=s@123;autocommit=True')
cursor = cnxn.cursor()
cursor.execute("select address as data  from Address_mast")
data=[]
for row in cursor.fetchall():

    data.append(row[0]) 

data = np.array(data)

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('data').read()))
def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or known(edits3(word)) or known(edits4(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def edits3(word): 

    return (e3 for e2 in edits2(word) for e3 in edits1(e2))

def edits4(word): 

    return (e4 for e3 in edits3(word) for e4 in edits1(e3))


sqlstr = ""
j=0
k=0
for i in data:
    sqlstr=" update customer_master set Address='"+correction(data)+"' where data="+correction(data)
    cursor.execute(sqlstr)

    j=j+1
    k=k+cursor.rowcount
cnxn.commit()
cursor.close()
cnxn.close()
print(str(k) +" Records Completed")

从这个我无法得到正确的输出,任何关于什么变化的建议..谢谢提前

2 个答案:

答案 0 :(得分:0)

我们可以使用现有的1个编辑列表,并对该列表中的成员进行1次编辑

算法: One_Edit_Words = edits1(word) 对于One_Edit_Words中的每一个      做edits1(每个)

def edit2(word): new = edits1(word) # Get list of all the one edits for i in edits1(word): # Iterate through all the objects in one edit list new.update(edits1(i)) # make one edit for each object and add in list return new # Return list with all the edits

同样我们可以使用相同的方法来获取任意数量的编辑功能下面的功能将帮助您获得3次编辑

def edit3(word): new = edit2(word)
for i in edit2my(word): new.update(edits1(i)) return new
警告 : 即使是小型计算也会占用太多时间(时间复杂度很高)

答案 1 :(得分:0)

上面的答案是可以的,但是有一个比检查编辑距离k的按指数增长的字符串集更快的解决方案。假设我们有一个数据结构将所有单词的集合存储在一个树形结构中。这很有用,因为例如,我们知道我们不需要搜索没有单词的路径。这既提高了内存效率,又提高了计算效率。

假设我们有一个词汇表存储在集合,字典或理想情况下的collections.Counter对象中,那么我们可以按如下方式设置数据结构:

class VocabTreeNode:
    def __init__(self):
        self.children = {}
        self.word = None
        
    def build(self, vocab):
        for w in vocab:
            self.insert(w)

    def insert( self, word):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = VocabTreeNode()
            node = node.children[letter]
        node.word = word

要仅搜索距单词的编辑距离为k的元素集,可以为该结构进行递归搜索。

    def search(self, word, maxCost):
        currentRow = range( len(word) + 1 )    
        results = []
        for letter in self.children:
            self.searchRecursive(self.children[letter], letter, 
                                 word, currentRow, results, 
                                 maxCost)   
        return results
            
    def searchRecursive(self, node, letter, word, previousRow, 
                        results, maxCost):
        columns = len( word ) + 1
        currentRow = [ previousRow[0] + 1 ]
        for column in range( 1, columns ):
            insertCost = currentRow[column - 1] + 1
            deleteCost = previousRow[column] + 1
            if word[column - 1] != letter:
                replaceCost = previousRow[ column - 1 ] + 1
            else:                
                replaceCost = previousRow[ column - 1 ]
            currentRow.append( min( insertCost, deleteCost, replaceCost ) )
    
        if currentRow[-1] <= maxCost and node.word != None:
            results.append( (node.word, currentRow[-1] ) )
        if min( currentRow ) <= maxCost:
            for next_letter in node.children:
                self.searchRecursive( node.children[next_letter], next_letter, word,
                                      currentRow, results, maxCost)

我不确定如何克服一个问题。换位不能作为路径使用,因此我不确定如何在没有稍微复杂的技巧的情况下将换位作为编辑距离1合并。

我的词库是97722(几乎所有Linux发行版中的词集)。

sleep(1)
start = time()

for i in range(100):
    x = V.search('elephant',3)
    
print(time()- start)

>>> 17.5 

相当于每0.175秒编辑一次此单词的距离3计算。编辑距离4可以在.377秒内完成,而使用edits1的连续编辑距离将很快导致系统内存不足。

由于不容易处理换位,因此这是实现高编辑距离的Norvig型算法的快速有效方法。