Python:如何在使用正则表达式时跳过具有额外字符的行?

时间:2015-07-22 20:35:37

标签: python regex text-processing

使用正则表达式时,如何仅从感兴趣的文本后面没有额外文本的行中选择文本?

对于以下输入文本,我只想选择string1到string10,并跳过在同一行上有“blah”的字符串。

输入文字文件:

[random lines of text]
DATE/USER: 07/01/15   string1
[random lines of text]
DATE/USER: 07/12/15   string2
[random lines of text]
DATE/USER: 07/04/15   string3
[random lines of text]
DATE/USER: 07/12/15   string4
[random lines of text]
DATE/USER: 07/05/15   string5      * blah1 *
[random lines of text]
DATE/USER: 07/02/15   string6
[random lines of text]
DATE/USER: 07/08/15   string7
[random lines of text]
DATE/USER: 07/11/15   string8      * blah2 *
[random lines of text]
DATE/USER: 07/03/15   string9
[random lines of text]
DATE/USER: 07/10/15   string10      * blah3 *
[random lines of text]

我目前的代码:

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
    if rphfind:
    print rphfind[0].strip()

输出:

string1
string2
string3
string4
string5      * blah1 *
string6
string7
string8      * blah2 *
string9
string10      * blah3 *

再次,只是试图抓住字符串并跳过那些在同一行上有“啰嗦”的字符串。我的输出应该排除string5,string 8和string10。

编辑:道歉。做了一些编辑,以完善我要求实现的目标。

3 个答案:

答案 0 :(得分:3)

根据你的编辑,你绝对可以分开:

with open("in.txt") as f:
    for line in f:
        if line.startswith("DATE/USER:"):
            spl = line.split()
            if len(spl) == 3:
                print(spl[2])

输出:

string1
string2
string3
string4
string6
string7
string9

使用re:

with open("in.txt") as f:
    import re
    r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
    for line in f:
        match = r.search(line)
        if match:
           print(match.group(2))

输出:

string1
string2
string3
string4
string6
string7
string9

答案 1 :(得分:2)

re.findall('DATE/USER: \d\d/\d\d/\d\d\s+([A-Z])', line)

答案 2 :(得分:1)

&#39; $&#39;下面将实际排除任何后面有* blah *的行:

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)

只会匹配A,B,C,D,F,G,I

捕获组([A-Z])将只抓取单个大写字母,但仍允许任何行匹配(在您的示例中打印A到J)

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)

不确定您要查找的是哪个版本