我想搜索一个文本文件并打印出一行及其后续的3行,如果在行中找到关键字,则在后续的3行中找到不同的关键字。
我的代码现在打印的信息太多了。一旦已经打印了一部分,是否有办法前进到下一部分文本?
text = """
here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
"""
text2 = open("tmp.txt","w")
text2.write(text)
text2.close()
searchlines = open("tmp.txt").readlines()
data = []
for m, line in enumerate(searchlines):
line = line.lower()
if "keyword" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
for line2 in searchlines[m:m+4]:
data.append(line2)
print ''.join(data)
现在的输出是:
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
我希望它只打印出来:
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
答案 0 :(得分:1)
所以你想打印出包含2个以上关键词的4行的所有块?
无论如何,这就是我刚刚提出的。也许你可以使用它:
text = """
here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
""".splitlines()
keywords = ['keyword', 'keyword2']
buffer, kw = [], set()
for line in text:
if len(buffer) == 0: # first line of a block
for k in keywords:
if k in line:
kw.add(k)
buffer.append(line)
continue
else: # continuous lines
buffer.append(line)
for k in keywords:
if k in line:
kw.add(k)
if len(buffer) > 3:
if len(kw) >= 2: # just print blocks with enough keywords
print '\n'.join(buffer)
buffer, kw = [], set()
答案 1 :(得分:1)
因此,正如其他人指出的那样,您的第一个关键字keyword
是第二个关键字keyword2
的子字符串。所以我使用regexp对象实现了这个,这样你就可以使用单词boundary anchor \b
。
import re
from StringIO import StringIO
text = """
here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
"""
def my_scan(data,search1,search2):
buffer = []
for line in data:
buffer.append(line)
if len(buffer) > 4:
buffer.pop(0)
if len(buffer) == 4: # Valid search block
if search1.search(buffer[0]) and search2.search("\n".join(buffer[1:3])):
for item in buffer:
yield item
buffer = []
# First search term
s1 = re.compile(r'\bkeyword\b')
s2 = re.compile(r'\bkeyword2\b')
for row in my_scan(StringIO(text),s1,s2):
print row.rstrip()
产地:
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
答案 2 :(得分:0)
您的关键字重叠:“关键字”是“keyword2”的子集。
此外,您的数据意味着您不希望看到第13行但是acc。问题陈述应该打印出来。
我将您的第一个关键字从“关键字”更改为“firstkey”,并且您的代码正常工作(第13行除外)。
$ diff /tmp/q /tmp/q2
4c4
< I want to print out this line and the following 3 lines only once keyword 2
---
> I want to print out this line and the following 3 lines only once firstkey 2
6c6
< print this line keyword 4
---
> print this line firstkey 4
11,12c11,12
< I want to print out this line again and the following 3 lines only once keyword 9
< please print this line keyword 10
---
> I want to print out this line again and the following 3 lines only once firstkey 9
> please print this line firstkey 10
30c30
< if "keyword" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
---
> if "firstkey" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
答案 3 :(得分:0)
首先,您可以像这样更正您的代码:
text = """
0//
1// here is some text 1
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
6// I don't want to print this line but I want to start looking for more text starting at this line 6
7// Don't print this line 7
8// Not this line either 8
A9// I want to print out this line again and the following 3 lines only once keyword 9
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
14// Start again searching here 14
15// etc.
"""
searchlines = map(str.lower,text.splitlines(1))
# splitlines(1) with argument 1 keeps the newlines
data,again = [],-1
for m, line in enumerate(searchlines):
if "keyword" in line and m>again and "keyword2" in ''.join(searchlines[m:m+4]):
data.extend(searchlines[m:m+4])
again = m+4
print ''.join(data)
其次,一个简短的正则表达式解决方案是
text = """
0//
1// here is some text 1
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
6// I don't want to print this line but I want to start looking for more text starting at this line 6
7// Don't print this line 7
8// Not this line either 8
A9// I want to print out this line again and the following 3 lines only once keyword 9
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
14// Start again searching here 14
15// etc.
"""
import re
regx = re.compile('(^.*?(?<=[ \t]){0}(?=[ \t]).*\r?\n'
'.*?((?<=[ \t]){1}(?=[ \t]))?.*\r?\n'
'.*?((?<=[ \t]){1}(?=[ \t]))?.*\r?\n'
'.*?(?(1)|(?(2)|{1})).*)'.\
format('keyword','keyword2'),re.MULTILINE|re.IGNORECASE)
print '\n'.join(m.group(1) for m in regx.finditer(text))
结果
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13