使用正则表达式时,如何仅从感兴趣的文本后面没有额外文本的行中选择文本?
对于以下输入文本,我只想选择string1到string10,并跳过在同一行上有“blah”的字符串。
输入文字文件:
[random lines of text]
DATE/USER: 07/01/15 string1
[random lines of text]
DATE/USER: 07/12/15 string2
[random lines of text]
DATE/USER: 07/04/15 string3
[random lines of text]
DATE/USER: 07/12/15 string4
[random lines of text]
DATE/USER: 07/05/15 string5 * blah1 *
[random lines of text]
DATE/USER: 07/02/15 string6
[random lines of text]
DATE/USER: 07/08/15 string7
[random lines of text]
DATE/USER: 07/11/15 string8 * blah2 *
[random lines of text]
DATE/USER: 07/03/15 string9
[random lines of text]
DATE/USER: 07/10/15 string10 * blah3 *
[random lines of text]
我目前的代码:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
if rphfind:
print rphfind[0].strip()
输出:
string1
string2
string3
string4
string5 * blah1 *
string6
string7
string8 * blah2 *
string9
string10 * blah3 *
再次,只是试图抓住字符串并跳过那些在同一行上有“啰嗦”的字符串。我的输出应该排除string5,string 8和string10。
编辑:道歉。做了一些编辑,以完善我要求实现的目标。
答案 0 :(得分:3)
根据你的编辑,你绝对可以分开:
with open("in.txt") as f:
for line in f:
if line.startswith("DATE/USER:"):
spl = line.split()
if len(spl) == 3:
print(spl[2])
输出:
string1
string2
string3
string4
string6
string7
string9
使用re:
with open("in.txt") as f:
import re
r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
for line in f:
match = r.search(line)
if match:
print(match.group(2))
输出:
string1
string2
string3
string4
string6
string7
string9
答案 1 :(得分:2)
re.findall('DATE/USER: \d\d/\d\d/\d\d\s+([A-Z])', line)
答案 2 :(得分:1)
&#39; $&#39;下面将实际排除任何后面有* blah *的行:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)
只会匹配A,B,C,D,F,G,I
捕获组([A-Z])将只抓取单个大写字母,但仍允许任何行匹配(在您的示例中打印A到J)
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)
不确定您要查找的是哪个版本