我有一组连接的单词,我想将它们分成数组
例如:
In [307]:
testDf = pd.DataFrame({'c':[1,2],'b':[2,2],'a':[3,4]}, columns=['c','b','a'])
testDf
Out[307]:
c b a
0 1 2 3
1 2 2 4
我找到split_word("acquirecustomerdata")
=> ['acquire', 'customer', 'data']
,但它不适用于64位窗口。
然后我尝试将每个字符串拆分为子字符串,然后将它们与wordnet进行比较以找到一个等效字。 例如:
pyenchant
但是这个解决方案不确定并且太长了。 所以我正在寻找你的帮助。
谢谢
答案 0 :(得分:5)
检查Word Segmentation Task来自Norvig的工作。
from __future__ import division
from collections import Counter
import re, nltk
WORDS = nltk.corpus.brown.words()
COUNTS = Counter(WORDS)
def pdist(counter):
"Make a probability distribution, given evidence from a Counter."
N = sum(counter.values())
return lambda x: counter[x]/N
P = pdist(COUNTS)
def Pwords(words):
"Probability of words, assuming each word is independent of others."
return product(P(w) for w in words)
def product(nums):
"Multiply the numbers together. (Like `sum`, but with multiplication.)"
result = 1
for x in nums:
result *= x
return result
def splits(text, start=0, L=20):
"Return a list of all (first, rest) pairs; start <= len(first) <= L."
return [(text[:i], text[i:])
for i in range(start, min(len(text), L)+1)]
def segment(text):
"Return a list of words that is the most probable segmentation of text."
if not text:
return []
else:
candidates = ([first] + segment(rest)
for (first, rest) in splits(text, 1))
return max(candidates, key=Pwords)
print segment('acquirecustomerdata')
#['acquire', 'customer', 'data']
为了更好的解决方案,你可以使用bigram / trigram。
答案 1 :(得分:0)
如果您有所有可能单词的列表,可以使用以下内容:
import re
word_list = ["go", "walk", "run", "jump"] # list of all possible words
pattern = re.compile("|".join("%s" % word for word in word_list))
s = "gowalkrunjump"
result = re.findall(pattern, s)