Question

使用正则表达式时，如何仅从感兴趣的文本后面没有额外文本的行中选择文本？

对于以下输入文本，我只想选择string1到string10，并跳过在同一行上有“blah”的字符串。

输入文字文件：

[random lines of text]
DATE/USER: 07/01/15   string1
[random lines of text]
DATE/USER: 07/12/15   string2
[random lines of text]
DATE/USER: 07/04/15   string3
[random lines of text]
DATE/USER: 07/12/15   string4
[random lines of text]
DATE/USER: 07/05/15   string5      * blah1 *
[random lines of text]
DATE/USER: 07/02/15   string6
[random lines of text]
DATE/USER: 07/08/15   string7
[random lines of text]
DATE/USER: 07/11/15   string8      * blah2 *
[random lines of text]
DATE/USER: 07/03/15   string9
[random lines of text]
DATE/USER: 07/10/15   string10      * blah3 *
[random lines of text]

我目前的代码：

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
    if rphfind:
    print rphfind[0].strip()

输出：

string1
string2
string3
string4
string5      * blah1 *
string6
string7
string8      * blah2 *
string9
string10      * blah3 *

再次，只是试图抓住字符串并跳过那些在同一行上有“啰嗦”的字符串。我的输出应该排除string5，string 8和string10。

编辑：道歉。做了一些编辑，以完善我要求实现的目标。

Answer 1

根据你的编辑，你绝对可以分开：

with open("in.txt") as f:
    for line in f:
        if line.startswith("DATE/USER:"):
            spl = line.split()
            if len(spl) == 3:
                print(spl[2])

输出：

string1
string2
string3
string4
string6
string7
string9

使用re：

with open("in.txt") as f:
    import re
    r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
    for line in f:
        match = r.search(line)
        if match:
           print(match.group(2))

输出：

string1
string2
string3
string4
string6
string7
string9

Answer 2

re.findall('DATE/USER: \d\d/\d\d/\d\d\s+([A-Z])', line)

Answer 3

＆＃39; $＆＃39;下面将实际排除任何后面有* blah *的行：

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)

只会匹配A，B，C，D，F，G，I

捕获组（[A-Z]）将只抓取单个大写字母，但仍允许任何行匹配（在您的示例中打印A到J）

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)

不确定您要查找的是哪个版本

Python：如何在使用正则表达式时跳过具有额外字符的行？

3 个答案: