Question

我有一个法语文本，其中包含以空格分隔的单词（例如république *）。我想从文本中删除这些分隔的单词，并将它们添加到列表中，同时保留文本中的标点符号和数字。我的代码可用于添加分隔的单词，但无法将数字保留在文本中。

import nltk
from nltk.tokenize import word_tokenize

import re

with open ('french_text.txt') as tx: 
#opening text containing the separated words
    #stores the text with the separated words
    text = word_tokenize(tx.read().lower()) 


with open ('Fr-dictionary.txt') as fr:  #opens the dictionary
    dic = word_tokenize(fr.read().lower()) #stores the first dictionary

pat=re.compile(r'[.?\-",:]+|\d+')

out_file=open("newtext.txt","w") #defining name of output file
valid_words=[ ] #empty list to append the words checked by the dictionary 
invalid_words=[ ] #empty list to append the errors found

for word in text:
    reg=pat.findall(word)
    if reg is True:
        valid_words.append(word)
    elif word in dic:
        valid_words.append(word)#appending to a list the words checked 
    else:
        invalid_words.append(word) #appending the invalid_words



a=' '.join(valid_words) #converting list into a string

print(a) #print converted list
print(invalid_words) #print errors found

out_file.write(a) #writing the output to a file

out_file.close()

因此，使用此代码，我的错误列表与数字一起出现。

['ments', 'prési', 'répu', 'blique', 'diri', 'geants', '»', 'grand-est', 'elysée', 'emmanuel', 'macron', 'sncf', 'pepy', 'montparnasse', '1er', '2017.', 'geoffroy', 'hasselt', 'afp', 's', 'empare', 'sncf', 'grand-est', '26', 'elysée', 'emmanuel', 'macron', 'sncf', 'saint-dié', 'epinal', '23', '2018', 'etat', 's', 'vosges', '2018']

我认为问题在于正则表达式。有什么建议么？谢谢！！

Answer 1

问题出在检查reg is True的if语句上。您不应将is的{{1}}运算符用于检查True的结果是否为正（即您有匹配的单词）。

您可以改为：

pat.findall(word)

Answer 2

提示用户：这实际上是一个复杂的问题，因为这完全取决于我们定义的单词：

l’Académie是一个单词，j’eus呢？
gallo-romanes是一个单词还是c'est-à-dire？
J.-C.怎么样？
和xiv(e)（带有上标，如14 siecle）？
然后是QDN或QQ1或LOL？

这是一个直接解决方案，总结为：

将文本分为“单词”和“非单词”（标点，空格）
根据字典验证“单词”

# Adjust this to your locale
WORD = re.compile(r'\w+')

text = "foo bar, baz"

while True:
    m = WORD.search(text)
    if not m:
        if text:
            print(f"punctuation: {text!r}")
        break
    start, end = m.span()
    punctuation = text[:start]
    word = text[start:end]
    text = text[end:]
    if punctuation:
        print(f"punctuation: {punctuation!r}")
    print(f"possible word: {word!r}")

possible word: 'foo'
punctuation: ' '
possible word: 'bar'
punctuation: ', '
possible word: 'baz'

我感觉到您正在尝试处理故意拼写错误/不完整的单词，例如如果有人试图绕开论坛黑名单规则或语音分析。

然后，更好的方法将是：

使用字典识别什么是“单词”或“非单词”
然后分解文本

如果原始文本是为了逃避计算机而为人类所阅读，那么最好的选择是ML / AI，最有可能是神经网络，例如RNN用来识别图像中的对象。

仅从文本中删除未知单词，但保留标点符号和数字

2 个答案: