如何分开两个连词

时间:2017-07-11 18:23:09

标签: python-3.x nlp tokenize

我有一个评论数据集,我想使用NLP技术处理它。我做了所有预处理阶段(删除停止词,词干等)。我的问题是,有一些词是相互联系的,我的功能并不理解。这是一个例子:

Great services. I had a nicemeal and I love it a lot. 

如何从美味美食更正?

1 个答案:

答案 0 :(得分:2)

Peter Norvig为您遇到的分词问题提供了很好的解决方案。长话短说,他使用大量的单词(和bigram)频率数据集和一些动态编程将长串连接的单词分成最可能的分段。

您使用源代码和单词频率下载zip file并根据您的使用情况进行调整。这是完整性的相关位。

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

你也可以使用segment2方法,因为它使用双字母并且更准确。