这是一个棘手的问题,我已经阅读了很多关于它的帖子,但我还没有能够使它发挥作用。
我有一个大文件。我需要逐行读取它,一旦我到达"Total is: (any decimal number)"
形式的一行,取这个字符串并将数字保存在变量中。如果数字大于40.0,那么我需要在Total
行之上找到第四行(例如,如果Total
行是第39行,则该行将是第35行)。此行的格式为"(number).(space)(substring)"
。最后,我需要解析这个子字符串并对其进行进一步处理。
这是输入文件的示例:
many lines that we don't care about
many lines that we don't care about
...
1. Hi45
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
...
more lines we don't care about
and then more lines and then
again we get
2. How144
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
我尝试了很多东西,包括使用re.search()
方法从我需要关注的每一行中捕获我需要的东西。
这是我的代码,我从另一个stackoverflow Q&答:
import re
import linecache
number = ""
higher_line = ""
found_line = ""
with open("filename_with_many_lines.txt") as aFile:
for num, line in enumerate(aFile, 1):
searchObj = re.search(r'(\bTotal\b)(\s)(\w+)(\:)(\s)(\d+.\d+)', line)
if searchObj:
print "this is on line", line
print "this is the line number:", num
var1 = searchObj.group(6)
print var1
if float(var1) > 40.0:
number = num
higher_line = number - 4
print number
print higher_line
found_line = linecache.getline("filename_with_many_lines.txt", higher_line)
print "found the", found_line
预期输出为:
this is on line Total is: 45.5
this is the line number: 14857
14857
14853
found the 1. Hi145
this is on line Total is: 62.1
this is the line number: 14985
14985
14981
found the 2.How144
答案 0 :(得分:2)
如果您需要的行总是在Total is:
行上方四行,则可以将前一行保留为有界deque
。
from collections import deque
with open(filename, 'r') as file:
previous_lines = deque(maxlen=4)
for line in file:
if line.startswith('Total is: '):
try:
higher_line = previous_lines[-4]
# store higher_line, do calculations, whatever
break # if you only want to do this once; else let it keep going
except IndexError:
# we don't have four previous lines yet
# I've elected to simply skip this total line in that case
pass
previous_lines.append(line)
如果添加新项目会使其超过其最大长度,则有界deque
(具有最大长度的一个)将丢弃对方的项目。在这种情况下,我们将字符串附加到deque
的右侧,因此一旦deque
的长度达到4
,我们附加到右侧的每个新字符串都会导致它从左侧丢弃一个字符串。因此,在for
循环的开头,deque
将包含当前行之前的四行,最左边的行(索引0
)。
事实上,the documentation on collections.deque
提到了与我们非常相似的案例:
有界长度deques提供类似于Unix中
tail
过滤器的功能。它们还可用于跟踪仅涉及最近活动的交易和其他数据池。
答案 1 :(得分:1)
这将以数字和点开头的行存储到名为prevline
的变量中。仅当prevline
返回匹配对象时,我们才会打印re.search
。
import re
with open("file") as aFile:
prevline = ""
for num, line in enumerate(aFile,1):
m = re.match(r'\d+\.\s*.*', line) # stores the match object of the line which starts with a number and a dot
if m:
prevline += re.match(r'\d+\.\s*(.*)', line).group() # If there is any match found then this would append the whole line to the variable prevline. You could also write this line as prevline += m.group()
searchObj = re.search(r'(\bTotal\b\s+\w+:\s+(\d+\.\d+))', line) # Search for the line which contains the string Total plus a word plus a colon and a float number
if searchObj: # if there is any
score = float(searchObj.group(2)) # then the float number is assigned to the variable called score
if score > 40.0: # Do all the below operations only if the float number we fetched was greater than 40.0
print "this is the line number: ", num
print "this is the line", searchObj.group(1)
print num
print num-4
print "found the", prevline
prevline = ""
<强>输出:强>
this is on line Total is: 45.5
this is the line number: 8
8
4
found the 1. Hi45
this is on line Total is: 62.1
this is the line number: 20
20
16
found the 2. How144
答案 2 :(得分:1)
我建议对Blacklight Shining的帖子进行编辑,该帖子建立在deque
解决方案的基础上,但它被拒绝,并建议将其改为答案。下面,我将展示Blacklight的解决方案如何解决您的问题,如果您只是盯着它看一下。
with open(filename, 'r') as file:
# Clear: we don't care about checking the first 4 lines for totals.
# Instead, we just store them for later.
previousLines = []
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
previousLines.append(file.readline())
# The earliest we should expect a total is at line 5.
for lineNum, line in enumerate(file, 5):
if line.startswith('Total is: '):
prevLine = previousLines[0]
high_num = prevLine.split()[1] # A
score = float(line.strip("Total_is: ").strip("\n").strip()) # B
if score > 40.0:
# That's all! We've now got everything we need.
# Display results as shown in example code.
print "this is the line number : ", lineNum
print "this is the line ", line.strip('\n')
print lineNum
print (lineNum - 4)
print "found the ", prevLine
# Critical - remove old line & push current line onto deque.
previousLines = previousLines[1:] + [line]
我没有利用deque
,但我的代码必须完成同样的事情。我认为它不一定比其他任何一个更好的答案;我发布它是为了说明您尝试解决的问题如何通过非常简单的算法和简单的工具来解决。 (将Avinash聪明的17线解决方案与我的低压18线解决方案进行比较。)
这种简化的方法不会让你看起来像一个向任何人阅读你的代码的向导,但它也不会意外匹配任何干涉线。如果您在使用正则表达式命中行时设置死机,则只需修改A行和B行。一般解决方案仍然有效。
重点是,一种简单的方法来记住4行后面的内容是将最后四行存储在内存中。