Question

我有一个段落，我想通过将标点与单词分开来标记它，并打印结果。有一些特殊情况（缩写，如美国和撇号，如彼得，和十进制数字）应附在信件上，不要将它们分开。

我运行以下代码：

import re

text = "My weight is about 68 kg, +/- 10 grams! I live in U.S.A. at Mr. 
Peter's house!  3,500 calorie rule, which equates a weight alteration 
of 2.2 lb"

pattern = r"""(?:[A-Z]\.)+ |\d+(?:\.\d+)?%?|\w/.+$\s-|\w+(?:[-']\w+)*|
(?:[+/\-@&*]|/.$/)"""

print (re.findall(pattern, text))

输出：

['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 
 'grams', 'I', 'live', 'in', 'U.S.A. ', 'at', 'Mr', "Peter's", 'house',
 '3', '500', 'calorie', 'rule', 'which', 'equates', 'a', 'weight',
 'alteration', 'of', '2.2', 'lb'
]

此代码存在一些错误，我真的需要帮助来修复它们：

删除所有标点符号！我想保留它们，但与单词分开。
模式忽略数字contains（，）并将其删除。我已将\d+(?:\,\d+)?%?添加到模式中，但它无法正常工作。
该模式也会忽略某些缩写，例如Mr.

非常感谢您的帮助！

Answer 1

我建议您避免尝试使用正则表达式执行此操作并使用专为此作业设计的工具。以下内容应处理U.S.A.和Peter's：

from nltk.tokenize import WhitespaceTokenizer, word_tokenize

text = "My weight is about 68 kg, +/- 10 grams! I live in U.S.A. at Mr. Peter's house!  3,500 calorie rule, which equates a weight alteration of 2.2 lb"

print WhitespaceTokenizer().tokenize(text)
print word_tokenize(text)

这将为您提供以下可能性：

['My', 'weight', 'is', 'about', '68', 'kg,', '+/-', '10', 'grams!', 'I', 'live', 'in', 'U.S.A.', 'at', 'Mr.', "Peter's", 'house!', '3,500', 'calorie', 'rule,', 'which', 'equates', 'a', 'weight', 'alteration', 'of', '2.2', 'lb']
['My', 'weight', 'is', 'about', '68', 'kg', ',', '+/-', '10', 'grams', '!', 'I', 'live', 'in', 'U.S.A.', 'at', 'Mr.', 'Peter', "'s", 'house', '!', '3,500', 'calorie', 'rule', ',', 'which', 'equates', 'a', 'weight', 'alteration', 'of', '2.2', 'lb']

Answer 2

如果您不打算使用完整的自然语言处理工具，我建议您使用更简单的模式并计划一些解析后的清理工作。尝试在模式匹配中解决所有是棘手的，并且可能会继续失败，因为引入了新的语法元素。也就是说，这是一种更简单的模式方法，我认为它处理了大多数与您有关的异常：

import re

text = "My weight is about 68 kg, +/- 10 grams! I live in U.S.A. at Mr. Peter's house!  3,500 calorie rule, which equates a weight alteration of 2.2 lb"

pattern = r"(\s+|(?:[A-Z']\.?)+)"

tokens = [token for token in re.split(pattern, text, flags=re.I) if token and not token.isspace()]

print(tokens)

输出

['My', 'weight', 'is', 'about', '68', 'kg', ',', '+/-', '10', 'grams',
'!', 'I', 'live', 'in', 'U.S.A.', 'at', 'Mr.', "Peter's", 'house', '!',
'3,500', 'calorie', 'rule', ',', 'which', 'equates', 'a', 'weight',
'alteration', 'of', '2.2', 'lb']

而不是re.findall()，我使用带有模式保留的re.split()来隔离字符串中的标记（即我们在单词上分开）。当出现新的异常时，评估它是否值得使模式复杂化或他们是否可以在解析前或解析后处理。

用于缩写和标点符号的正则表达式模式

2 个答案: