我有这个日志文件“ internet.log”,大约10GB。当我在python中解析它时,出现异常“ MemoryError”。日志文件看起来像这样...
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: query[A] fd-geoycpi-uno.gycpi.b.yahoodns.net from 192.168.1.33
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:23 dnsmasq[1979]: query[A] armdl.adobe.com from 192.168.1.24
我当前正在使用此方法来解析日志文件:
def parse():
Date = []
IPAddress = []
DomainsVisited = []
with open("internet.log", "r") as file:
content = file.readlines()
for items in content:
if 'query[A]' in items:
getDate(Date, items)
getIPAddress(IPAddress, items)
getDomainsVisited(DomainsVisited, items)
finalResult = [[i, j, k] for i, j, k in zip(Date, IPAddress, DomainsVisited)]
return display(finalResult)
如果我解析一个大约10MB的日志文件,则会显示输出,但是当我解析10GB的日志文件时,我得到了错误。我怎样才能解决这个问题?谢谢。
答案 0 :(得分:0)
您不应使用file.readlines()
。这样做会立即将整个文件读入内存,这很可能会立即将其填满。而是遍历文件:
with open("internet.log", "r") as file:
for items in file:
(当然,取决于您对数据的处理方式,当您遍历文件时,它仍然可能会中断。)
答案 1 :(得分:0)
您正在使用readlines
将整个文件读入内存。
您可以说for items in file
来一次阅读一行。
使用更好的变量名和列表理解来稍微整理代码,以生成结果:
with open("internet.log") as log:
finalResults = [[getDate(line), getIPAddress(line), getDomainsVisited(line)]
for line in log
if 'query[A]' in line]
我将结果提取到一个函数中
def parse_log_line(line):
return [getDate(line),
getIPAddress(line),
getDomainsVisited(line)]
那么您的代码将是:
with open("internet.log") as log:
finalResults = [parse_log_line(line)
for line in log
if 'query[A]' in line]