如何在python中读取6GB日志文件,而不先将整个文件加载到内存中?

时间:2015-06-16 11:01:34

标签: python

我想在缓冲区中读取一个大日志文件(6GB),我的意思是读取100 MB然后睡几秒钟,而且我想阻止在内存中加载文件内容,我想像头一样阅读它 - nx在bash中,文件也包含块,每个块包含很多行,每个块之间有3个空行,例如:

[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET /mobile/ HTTP/1.1
host: www.my-host.com:8082
accept: */*
accept-language: en-gb
connection: keep-alive
accept-encoding: gzip, deflate
user-agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12D508
x-sub-imsi: 418876678
x-sub-msisdn: 333123654



[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET / HTTP/1.1
content-type: application/x-www-form-urlencoded
user-agent: Dalvik/1.6.0 (Linux; U; Android 4.4.2; AirPhoneS6 Build/KOT49H)
host: www.my-host.net
connection: Keep-Alive
accept-encoding: gzip
x-sub-imsi: 418252632
x-sub-msisdn: 333367627836



HTTP/1.1 302 Found
Location: http://www.my-host.net/welcome/main.html
Set-Cookie: oam.Flash.RENDERMAP.TOKEN=-jdrkoipfe; Path=/



[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET / HTTP/1.1
content-type: application/x-www-form-urlencoded
user-agent: Dalvik/1.6.0 (Linux; U; Android 4.4.2; AirPhoneS6 Build/KOT49H)
host: www.my-host.net
connection: Keep-Alive
accept-encoding: gzip
x-sub-imsi: 41887237832
x-sub-msisdn: 333878778

我想将user-agent及其msisdn和平台版本导出到csv文件,所以我要生成2个文件,ios.cs和android.csv,每个文件都包含uniq msisdn 该文件将是: user-agent,version,msisdn 例: Android,4.2.2,333878778

所以我必须逐块检查然后检查用户代理行,然后检查它的msisdn。我尝试在bash中执行它,但由于bash不是那么灵活,所以我决定在python中执行它

2 个答案:

答案 0 :(得分:0)

你可以使用提供迭代器的fileinput库,所以我不认为它会将整个文件加载到内存中,除非你这样做。

import fileinput
import time

file = fileinput.input('my_log_file.txt')

for line in file:
    # do your computation
    time.sleep(5)

答案 1 :(得分:-1)

def readFile(inputFile):
    file_object = open(inputFile, 'rb')
    buff = int(1E6) #100 Megabyte
    while True:
        block = file_object.read(buff)
        if not buff: time.sleep(3)
        doSomeThing(block)
        block = file_object.read(buff)
    file_object.close()


# time python readfile.py