Question

我正在尝试使用正则表达式在文本文档中的一行上找到特定单词。我尝试使用下面的代码，但它无法正常工作。

import re
f1 = open('text.txt', 'r')
for line in f1:
    m = re.search('(.*)(?<=Dog)Food(.*)', line)
    m.group(0)
    print "Found it."
f1.close()

错误：

Traceback (most recent call last):
  File "C:\Program Files (x86)\Microsoft Visual Studio 11.0
ns\Microsoft\Python Tools for Visual Studio\2.0\visualstudi
0, in exec_file
    exec(code_obj, global_variables)
  File "C:\Users\wsdev2\Documents\Visual Studio 2012\Projec
TML Head Script\HTML_Head_Script.py", line 6, in <module>
    m.group(0)
AttributeError: 'NoneType' object has no attribute 'group'

Answer 1

您收到AttributeError: 'NoneType' object has no attribute 'group'，因为找不到匹配项。

如果没有匹配，

re.search()将返回None，因此您可以执行此操作：

import re
with open('text.txt', 'r') as myfile:
    for line in myfile:
        m = re.search('(.*)(?<=Dog)Food(.*)', line)
        if m is not None:
            m.group(0)
            print "Found it."
            break # Break out of the loop

编辑：我用你的代码编辑了我的答案。另外，我在这里使用with/as，因为它后来自动关闭文件（看起来很酷：p）

Answer 2

您的计划存在以下几个问题：

m将为无，这就是您的程序崩溃的原因。
您的代码只会找到该行中的第一个匹配项（如果存在）。您可以使用re.finditer() method来迭代所有匹配。
在单词之前和之后使用.*会在该单词出现在另一个单词（例如DogFooding）的中间时与该单词匹配。这可能不是你想要的。相反，您可以在匹配中使用神奇的\b原子，re documentation描述为

\b匹配空字符串，但仅匹配单词的开头或结尾。单词被定义为字母数字或下划线字符的序列，因此单词的结尾由空格或非字母数字，非下划线字符表示......

你可能想要使用特殊的r'' raw string syntax而不是手动加倍反斜杠来逃避它。
使用(.*)查找匹配前后发生的事情会使正则表达式难以使用，因为即使单词多次出现，也不会出现非重叠匹配。而是使用match.start()和match.end()方法获取匹配的字符位置。 Python的match objects are documented online。

考虑到这一点，您的代码变为：

#!/usr/bin/env python2.7

import re
f1 = open('text.txt', 'r')
line_number = 1
for line in f1:
    for m in re.finditer(r'\bDogFood\b', line):
        print "Found", m.group(0), "line", line_number, "at", m.start(), "-", m.end()
    line_number += 1
f1.close()

使用此功能运行时text.txt：

This Food is good.
This DogFood is good.
DogFooding is great.
DogFood DogFood DogFood.

程序打印：

Found DogFood line 2 at 5 - 12
Found DogFood line 4 at 0 - 7
Found DogFood line 4 at 8 - 15
Found DogFood line 4 at 16 - 23

Python RE在文本文档中查找特定单词

2 个答案: