我想在缓冲区中读取一个大日志文件(6GB),我的意思是读取100 MB然后睡几秒钟,而且我想阻止在内存中加载文件内容,我想像头一样阅读它 - nx在bash中,文件也包含块,每个块包含很多行,每个块之间有3个空行,例如:
[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET /mobile/ HTTP/1.1
host: www.my-host.com:8082
accept: */*
accept-language: en-gb
connection: keep-alive
accept-encoding: gzip, deflate
user-agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12D508
x-sub-imsi: 418876678
x-sub-msisdn: 333123654
[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET / HTTP/1.1
content-type: application/x-www-form-urlencoded
user-agent: Dalvik/1.6.0 (Linux; U; Android 4.4.2; AirPhoneS6 Build/KOT49H)
host: www.my-host.net
connection: Keep-Alive
accept-encoding: gzip
x-sub-imsi: 418252632
x-sub-msisdn: 333367627836
HTTP/1.1 302 Found
Location: http://www.my-host.net/welcome/main.html
Set-Cookie: oam.Flash.RENDERMAP.TOKEN=-jdrkoipfe; Path=/
[18/05/2015:00:00:00 +0300]%PARSER_ERROR[elapsedTime]
GET / HTTP/1.1
content-type: application/x-www-form-urlencoded
user-agent: Dalvik/1.6.0 (Linux; U; Android 4.4.2; AirPhoneS6 Build/KOT49H)
host: www.my-host.net
connection: Keep-Alive
accept-encoding: gzip
x-sub-imsi: 41887237832
x-sub-msisdn: 333878778
我想将user-agent及其msisdn和平台版本导出到csv文件,所以我要生成2个文件,ios.cs和android.csv,每个文件都包含uniq msisdn 该文件将是: user-agent,version,msisdn 例: Android,4.2.2,333878778
所以我必须逐块检查然后检查用户代理行,然后检查它的msisdn。我尝试在bash中执行它,但由于bash不是那么灵活,所以我决定在python中执行它
答案 0 :(得分:0)
你可以使用提供迭代器的fileinput库,所以我不认为它会将整个文件加载到内存中,除非你这样做。
import fileinput
import time
file = fileinput.input('my_log_file.txt')
for line in file:
# do your computation
time.sleep(5)
答案 1 :(得分:-1)
def readFile(inputFile):
file_object = open(inputFile, 'rb')
buff = int(1E6) #100 Megabyte
while True:
block = file_object.read(buff)
if not buff: time.sleep(3)
doSomeThing(block)
block = file_object.read(buff)
file_object.close()
# time python readfile.py