我正在解析带有许多连字符的德语文本。为了检查一个单词是否是一个正确的德语单词(由于在行末仅被一个连字符分隔)或需要这些连字符,因为这实际上是应该写的方式,因此我目前正在扩展一个已定格的单词的集合我在这里找到的: https://github.com/michmech/lemmatization-lists
您能指出一个使用nltk的方法吗?
我的工作:当我的解析器遇到带有连字符的单词时,我会检查不带连字符的拼写(即,如果该单词包含在我的列表中并带有非修饰词)。如果它不包含在我的列表中(当前大约420,000个单词),我将检查自己是否应该将其添加到我的列表中或用连字符编写。
这是起作用的功能:
**function(sents, german_words, hyphened_words):**
clutter = '[*!?,.;:_\s()\u201C\u201D\u201E\u201F\u2033\u2036\u0022]'
sentences = list()
new_hyphened_words = list()
new_german_words = list()
skip = False
for i, sentence in enumerate(sents):
if skip:
skip = False
continue
new_sentence = ''
words = sentence.split(' ')
words = list(filter(None, words))
new_words = list() # words to make a correct sentence
last_word = words[-1]
last_word = last_word.strip()
if last_word[-1] == '-':
try:
next_sentence = sents[i+1]
except IndexError as e:
raise e
next_words = next_sentence.split(' ')
next_words = list(filter(None, next_words))
first_word = next_words[0]
new_word = last_word[:-1] + first_word
new_word = re.sub(clutter, '', word)
if _is_url_or_mail_address(new_word):
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
elif new_word in german_stopwords:
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
elif new_word in german_words:
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
else:
new_word = last_word + first_word # now with hyphen!
new_word = re.sub(clutter, '', word)
if new_word in hyphened_words:
new_words = words[:-1] + [new_word] + next_words[1:]
skip = True
continue
else: # found neither with nor without hyphen
with_hyphen = re.sub(clutter, '', last_word + first_word)
without_hyphen = re.sub(clutter, '', last_word[:-1] + first_word)
print(f'1: {with_hyphen}, 2: {without_hyphen}')
choose = input('1 or 2, or . if correction')
if choose == '1':
new_hyphened_words.append(with_hyphen)
new_words = words[:-1] + [last_word+first_word] + next_words[1:]
skip = True
continue
elif choose == '2':
new_german_words.append(without_hyphen)
new_words = words[:-1] + [last_word[:-1]+first_word] +\
next_words[1:]
skip = True
continue
else:
corrected_word = input('Corrected word: ')
print()
new_german_words.append(corrected_word)
print(f'Added to dict: "{corrected_word}"')
ok = input('Also add to speech? ./n')
if ok == 'n':
speech_word = input('Speech word: ')
new_words = words[:-1] + [speech_word] + next_words[1:]
skip = True
continue
else:
new_words = words[:-1] + [corrected_word] + next_words[1:]
skip = True
continue
else:
new_words = words
new_sentence = ' '.join(w for w in new_words)
sentences.append(new_sentence)
return sentences
列表“ german_words”和“ hyphened_words”会不时更新,因此它们包含以前会话中的新单词。
我的工作有效,但是工作缓慢。我一直在寻找使用nltk进行此操作的方法,但是我似乎看错了地方。您能指出我一种训练nltk单词集合的方式,还是使用一种更有效的处理方式吗?