读取一行将其存储在变量中,然后读取另一行并返回第一行。 Python 2

时间:2015-04-10 01:15:16

标签: python regex file line python-2.x

这是一个棘手的问题,我已经阅读了很多关于它的帖子,但我还没有能够使它发挥作用。

我有一个大文件。我需要逐行读取它,一旦我到达"Total is: (any decimal number)"形式的一行,取这个字符串并将数字保存在变量中。如果数字大于40.0,那么我需要在Total行之上找到第四行(例如,如果Total行是第39行,则该行将是第35行)。此行的格式为"(number).(space)(substring)"。最后,我需要解析这个子字符串并对其进行进一步处理。

这是输入文件的示例:

many lines that we don't care about
many lines that we don't care about
...
1. Hi45
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
...
more lines we don't care about
and then more lines and then
again we get
2. How144
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space

我尝试了很多东西,包括使用re.search()方法从我需要关注的每一行中捕获我需要的东西。

这是我的代码,我从另一个stackoverflow Q&答:

import re
import linecache
number = ""
higher_line = ""
found_line = ""

with open("filename_with_many_lines.txt") as aFile:
    for num, line in enumerate(aFile, 1):
        searchObj = re.search(r'(\bTotal\b)(\s)(\w+)(\:)(\s)(\d+.\d+)', line)
        if searchObj:
            print "this is on line", line
            print "this is the line number:", num
            var1 = searchObj.group(6)
            print var1
            if float(var1) > 40.0:
                number = num
                higher_line = number - 4
                print number
                print higher_line

                found_line = linecache.getline("filename_with_many_lines.txt", higher_line)
                print "found the", found_line

预期输出为:

this is on line Total is: 45.5
this is the line number: 14857
14857
14853
found the 1. Hi145
this is on line Total is: 62.1
this is the line number: 14985
14985
14981
found the 2.How144

3 个答案:

答案 0 :(得分:2)

如果您需要的行总是在Total is:行上方四行,则可以将前一行保留为有界deque

from collections import deque

with open(filename, 'r') as file:
    previous_lines = deque(maxlen=4)
    for line in file:
        if line.startswith('Total is: '):
            try:
                higher_line = previous_lines[-4]
                # store higher_line, do calculations, whatever
                break  # if you only want to do this once; else let it keep going
            except IndexError:
                # we don't have four previous lines yet
                # I've elected to simply skip this total line in that case
                pass
        previous_lines.append(line)

如果添加新项目会使其超过其最大长度,则有界deque(具有最大长度的一个)将丢弃对方的项目。在这种情况下,我们将字符串附加到deque的右侧,因此一旦deque的长度达到4,我们附加到右侧的每个新字符串都会导致它从左侧丢弃一个字符串。因此,在for循环的开头,deque将包含当前行之前的四行,最左边的行(索引0)。

事实上,the documentation on collections.deque提到了与我们非常相似的案例:

  

有界长度deques提供类似于Unix中tail过滤器的功能。它们还可用于跟踪仅涉及最近活动的交易和其他数据池。

答案 1 :(得分:1)

这将以数字和点开头的行存储到名为prevline的变量中。仅当prevline返回匹配对象时,我们才会打印re.search

import re
with open("file") as aFile:
    prevline = ""
    for num, line in enumerate(aFile,1):
        m = re.match(r'\d+\.\s*.*', line)                                # stores the match object of the line which starts with a number and a dot
        if m:                                              
            prevline += re.match(r'\d+\.\s*(.*)', line).group()         # If there is any match found then this would append the whole line to the variable prevline. You could also write this line as prevline += m.group()

        searchObj = re.search(r'(\bTotal\b\s+\w+:\s+(\d+\.\d+))', line)  # Search for the line which contains the string Total plus a word plus a colon and a float number
        if searchObj:                                                   # if there is any
            score = float(searchObj.group(2))                           # then the float number is assigned to the variable called score
            if score > 40.0:                                            # Do all the below operations only if the float number we fetched was greater than 40.0
                print "this is the line number: ", num
                print "this is the line", searchObj.group(1)
                print num
                print num-4
                print "found the", prevline
                prevline = ""

<强>输出:

this is on line Total is: 45.5
this is the line number:  8
8
4
found the 1. Hi45
this is on line Total is: 62.1
this is the line number:  20
20
16
found the 2. How144

答案 2 :(得分:1)

我建议对Blacklight Shining的帖子进行编辑,该帖子建立在deque解决方案的基础上,但它被拒绝,并建议将其改为答案。下面,我将展示Blacklight的解决方案如何解决您的问题,如果您只是盯着它看一下。

with open(filename, 'r') as file:
    # Clear: we don't care about checking the first 4 lines for totals.
    # Instead, we just store them for later.
    previousLines = []
    previousLines.append(file.readline())
    previousLines.append(file.readline())
    previousLines.append(file.readline())
    previousLines.append(file.readline())

    # The earliest we should expect a total is at line 5.
    for lineNum, line in enumerate(file, 5):
        if line.startswith('Total is: '):
            prevLine = previousLines[0]
            high_num = prevLine.split()[1] # A
            score = float(line.strip("Total_is: ").strip("\n").strip()) # B

            if score > 40.0:
                # That's all! We've now got everything we need.
                # Display results as shown in example code.
                print "this is the line number : ", lineNum
                print "this is the line ", line.strip('\n')
                print lineNum
                print (lineNum - 4)
                print "found the ", prevLine

        # Critical - remove old line & push current line onto deque.
        previousLines = previousLines[1:] + [line]

我没有利用deque,但我的代码必须完成同样的事情。我认为它不一定比其他任何一个更好的答案;我发布它是为了说明您尝试解决的问题如何通过非常简单的算法和简单的工具来解决。 (将Avinash聪明的17线解决方案与我的低压18线解决方案进行比较。)

这种简化的方法不会让你看起来像一个向任何人阅读你的代码的向导,但它也不会意外匹配任何干涉线。如果您在使用正则表达式命中行时设置死机,则只需修改A行和B行。一般解决方案仍然有效。

重点是,一种简单的方法来记住4行后面的内容是将最后四行存储在内存中。