我正在尝试使用nltk模块找到一种在Python中拆分单词的方法。我不确定如何达到我的目标,因为我有原始数据,这是一个标记化词的列表,例如
['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
正如你所看到的那样,许多单词被粘在了一起(即'生成'被困在一个字符串中' toproduce')。这是一个从PDF文件中抓取数据的工件,我想找到一种方法,使用python中的nltk模块来分割粘在一起的单词(即拆分' toproduce'分为两个单词:&#39制作'制作&#39 ;;将标准操作程序分成三个词:'标准','操作','程序'。)
我感谢任何帮助!
答案 0 :(得分:5)
我相信你会想要在这种情况下使用分词,我不知道NLTK中的任何分词功能将处理没有空格的英语句子。您可以使用pyenchant
代替。我仅以示例的方式提供以下代码。 (它适用于适度数量的相对较短的字符串 - 例如示例列表中的字符串 - 但对于较长的字符串或更多的字符串来说效率非常低。)它需要修改,并且不会成功地对每个字符串进行分段无论如何都是字符串。
import enchant # pip install pyenchant
eng_dict = enchant.Dict("en_US")
def segment_str(chars, exclude=None):
"""
Segment a string of chars using the pyenchant vocabulary.
Keeps longest possible words that account for all characters,
and returns list of segmented words.
:param chars: (str) The character string to segment.
:param exclude: (set) A set of string to exclude from consideration.
(These have been found previously to lead to dead ends.)
If an excluded word occurs later in the string, this
function will fail.
"""
words = []
if not chars.isalpha(): # don't check punctuation etc.; needs more work
return [chars]
if not exclude:
exclude = set()
working_chars = chars
while working_chars:
# iterate through segments of the chars starting with the longest segment possible
for i in range(len(working_chars), 1, -1):
segment = working_chars[:i]
if eng_dict.check(segment) and segment not in exclude:
words.append(segment)
working_chars = working_chars[i:]
break
else: # no matching segments were found
if words:
exclude.add(words[-1])
return segment_str(chars, exclude=exclude)
# let the user know a word was missing from the dictionary,
# but keep the word
print('"{chars}" not in dictionary (so just keeping as one segment)!'
.format(chars=chars))
return [chars]
# return a list of words based on the segmentation
return words
正如您所看到的,这种方法(可能)只会错误地分割您的一个字符串:
>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework']
>>> [segment(chars) for chars in t]
"genotypes" not in dictionary (so just keeping as one segment)
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']]
然后,您可以使用chain
展平此列表列表:
>>> from itertools import chain
>>> list(chain.from_iterable(segment_str(chars) for chars in t))
"genotypes" not in dictionary (so just keeping as one segment)!
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework']