Question

我正在使用以下正则表达式，它假设找到字符串'U.S.A.'，但它只获得'A.'，是否有人知道错误是什么？

#INPUT
import re

text = 'That U.S.A. poster-print costs $12.40...'

print re.findall(r'([A-Z]\.)+', text)

#OUTPUT
['A.']

预期产出：

['U.S.A.']

我遵循NLTK书第3.7章here，它有一套正则表达式，但它不起作用。我已经在Python 2.7和3.4中尝试过了。

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

nltk.regexp_tokenize（）与 re.findall（）的工作方式相同，我想我的python在某种程度上无法按预期识别正则表达式。上面列出的正则表达式输出：

[('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

Answer 1

可能，它与先前使用{3}中废除的nltk.internals.compile_regexp_to_noncapturing()编译正则表达式有关，请参阅here）

>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> 
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

但它不适用于NLTK v3.1：

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

稍微修改一下你的正则表达式组的定义，你可以使用这个正则表达式在NLTK v3.1中使用相同的模式：

pattern = r"""(?x)                   # set flag to allow verbose regexps
              (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
              |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
              |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
              |(?:[+/\-@&*])         # special characters with meanings
            """

在代码中：

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x)                   # set flag to allow verbose regexps
... (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*])         # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

如果没有NLTK，使用python＆＃39; re模块，我们会发现本地不支持旧的正则表达式模式：

>>> pattern1 = r"""(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               |\w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               |[+/\-@&*]        # special characters with meanings
...               |\S\w*                       # any sequence of word characters# 
... """            
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x)                   # set flag to allow verbose regexps
...                       (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
...                       |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
...                       |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
...                       |(?:[+/\-@&*])         # special characters with meanings
...                     """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

注意： NLTK的RegexpTokenizer如何编译正则表达式的变化会使NLTK's Regular Expression Tokenizer上的示例也过时。

Answer 2

删除尾随+，或将其放入群组中：

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.']              # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.']  # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.']          # with '+' inside the group

Answer 3

正则表达式匹配的文本的第一部分是＆＃34; U.S.A。＆＃34;因为([A-Z]\.)+与第一组（括号内的部分）匹配三次。但是，每组只能返回一个匹配项，因此Python会选择该组的最后一个匹配项。

如果您改为使用正则表达式来包含＆＃34; +＆＃34;在组中，该组将只匹配一次，并返回完整的匹配。例如(([A-Z]\.)+)或((?:[A-Z]\.)+)。

如果您想要三个单独的结果，那么只需摆脱＆＃34; +＆＃34;在正则表达式中签名，每次只匹配一个字母和一个点。

Answer 4

问题是“捕获组”，也就是括号，它对findall()的结果产生了意想不到的影响：当一个匹配中多次使用捕获组时，正则表达式引擎失去跟踪和奇怪事情发生。具体来说：正则表达式正确匹配整个U.S.A.，但findall将其丢弃在地板上并且仅返回最后一组捕获。

正如this answer所述，re模块不支持重复捕获组，但您可以安装正确处理此问题的备用regexp模块。（但是，如果您想将正则表达式传递给nltk.tokenize.regexp，这对您没有帮助。）

无论如何要正确匹配U.S.A.，请使用：r'(?:[A-Z]\.)+', text)。

>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']

您可以对NLTK正则表达式中的所有重复模式应用相同的修复，一切都将正常工作。正如@alvas建议的那样，NLTK过去常常在幕后进行替换，但最近这个功能被删除，并在tokenizer文档中用a warning替换。这本书显然已经过时了; @alvas在11月份提交了bug report，但尚未采取行动......

Python：正则表达式无法正常工作

4 个答案: