Question

我想处理PostgreSQL文档中的一些句子并进行一些分析。在分词阶段，我尝试使用Lotufo等人提出的正则表达式'[\ w-] +（。[\ w-] +）*'。在建模紧急错误报告的阅读过程中进行总结错误报告。我在Python中使用此正则表达式无法获得预期的答案，这很奇怪。

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 6.4.0 -- An enhanced Interactive Python.
>>> import re
>>> result = re.findall(r'[\w-]+(\.[\w-]+)*', 'Specifies the directory to use for data storage.')
>>> print(result)

我希望得到一个单词列表：

['Specifies', 'the', 'directory', 'to', 'use', 'for', 'data', 'storage']

但是我只有一个空字符串列表：

['', '', '', '', '', '', '', '']

有人知道我的代码有什么问题吗？非常感谢。

Answer 1

这可以按照您期望的方式工作：

Python 3.7.2 (default, Jan 16 2019, 19:49:22) 
[GCC 8.2.1 20181215 (Red Hat 8.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> split = re.compile('(\w+)')
>>> split.findall('Specifies the directory to use for data storage.')
['Specifies', 'the', 'directory', 'to', 'use', 'for', 'data', 'storage']
>>>

正则表达式上的那些方括号感觉不正确。我想这是原因。

Answer 2

期望的字符串已匹配，但它们不在捕获组中。改用此正则表达式：

r'([\w-]+(?:\.[\w-]+)*)'

请注意，我在内部括号中添加了?:，以使其不被捕获。

正则表达式'[\ w-] +（\。[\ w-] +）*'不匹配

2 个答案: