Question

我有一组连接的单词，我想将它们分成数组

例如：

In [307]:
testDf = pd.DataFrame({'c':[1,2],'b':[2,2],'a':[3,4]}, columns=['c','b','a'])
testDf

Out[307]:
   c  b  a
0  1  2  3
1  2  2  4

我找到split_word("acquirecustomerdata") => ['acquire', 'customer', 'data']，但它不适用于64位窗口。

然后我尝试将每个字符串拆分为子字符串，然后将它们与wordnet进行比较以找到一个等效字。例如：

pyenchant

但是这个解决方案不确定并且太长了。所以我正在寻找你的帮助。

谢谢

Answer 1

检查Word Segmentation Task来自Norvig的工作。

from __future__ import division
from collections import Counter
import re, nltk

WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)

def pdist(counter):
    "Make a probability distribution, given evidence from a Counter."
    N = sum(counter.values())
    return lambda x: counter[x]/N

P = pdist(COUNTS)

def Pwords(words):
    "Probability of words, assuming each word is independent of others."
    return product(P(w) for w in words)

def product(nums):
    "Multiply the numbers together.  (Like `sum`, but with multiplication.)"
    result = 1
    for x in nums:
        result *= x
    return result

def splits(text, start=0, L=20):
    "Return a list of all (first, rest) pairs; start <= len(first) <= L."
    return [(text[:i], text[i:]) 
            for i in range(start, min(len(text), L)+1)]

def segment(text):
    "Return a list of words that is the most probable segmentation of text."
    if not text: 
        return []
    else:
        candidates = ([first] + segment(rest) 
                      for (first, rest) in splits(text, 1))
        return max(candidates, key=Pwords)

print segment('acquirecustomerdata')
#['acquire', 'customer', 'data']

为了更好的解决方案，你可以使用bigram / trigram。

更多示例：Word Segmentation Task

Answer 2

如果您有所有可能单词的列表，可以使用以下内容：

import re

word_list = ["go", "walk", "run", "jump"]  # list of all possible words
pattern = re.compile("|".join("%s" % word for word in word_list))

s = "gowalkrunjump"
result = re.findall(pattern, s)

在python中没有空格的分句（nltk？）

2 个答案: