我正在尝试解析包含WHOIS信息的非常大的文件(文件> 4G)。
我只需要文件中包含的信息的一部分。
目标是以JSON格式输出一些感兴趣的WHOIS字段。
#
# The contents of this file are subject to
# RIPE Database Terms and Conditions
#
# http://www.ripe.net/db/support/db-terms-conditions.pdf
#
inetnum: 10.16.151.184 - 10.16.151.191
netname: NETECONOMY-MG41731 ENTRY 1
descr: DUMMY FOO ENTRY 1
country: IT ENTRY 1
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: neteconomy.rete@example.com
mnt-by: INTERB-MNT
changed: unread@xxx..net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
% Tags relating to '80.16.151.184 - 80.16.151.191'
% RIPE-USER-RESOURCE
inetnum: 20.16.151.180 - 20.16.151.183
netname: NETECONOMY-MG41731 ENTRY 2
descr: DUMMY FOO ENTRY 2
country: IT ENTRY 2
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: neteconomy.rete@xxx.it
mnt-by: INTERB-MNT
changed: unread@xxx.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
我正在使用下面的代码进行解析和信息检索,我确信这远远没有被优化,我可以以更有效的方式获得类似的结果。
def create_json2():
regex_inetnum = r'inetnum:\s+(?P<inetnum_val>.*)'
regex_netname = r'netname:\s+(?P<netname_val>.*)'
regex_country = r'country:\s+(?P<country_val>.*)'
regex_descr = r'descr:\s+(?P<descr_val>.*)'
inetnum_list = []
netname_list = []
country_list = []
descr_list = []
records = []
with open(RIPE_DB, "r") as f:
for line in f:
inetnum = re.search(regex_inetnum, line, re.IGNORECASE)
netname = re.search(regex_netname, line, re.IGNORECASE)
country = re.search(regex_country, line, re.IGNORECASE)
descr = re.search(regex_descr, line, re.IGNORECASE)
if inetnum is not None:
inetnum_val = inetnum.group("inetnum_val").strip()
inetnum_list.append(inetnum_val)
if netname is not None:
netname_val = netname.group("netname_val").strip()
netname_list.append(netname_val)
if country is not None:
country_val = country.group("country_val").strip()
country_list.append(country_val)
if descr is not None:
descr_val = descr.group("descr_val").strip()
descr_list.append(descr_val)
for i,n,d,c in zip(inetnum_list, netname_list, descr_list, country_list):
data = {'inetnum': i, 'netname': n.upper(), 'descr': d.upper(), 'country': c.upper()}
records.append(data)
print json.dumps(records, indent=4)
create_json2()
当我开始解析文件时,它会在一段时间后停止并出现以下错误。
$> ./parse.py
Killed
文件处理期间RAM / CPU负载非常高。
相同的代码按预期工作,小文件没有错误。
您是否有建议,以便能够解析4G文件,并改善代码逻辑和质量?
答案 0 :(得分:0)
神奇的词是&#34; Flush&#34;,你需要从Python ASAP中获取数据(最好以批处理方式)。
#!/usr/bin/env python
import shelve
db = shelve.open('ipnum.db')
def split_line(line):
line = line.split(':')
key = line[0]
value = ':'.join(line[1:]).strip()
return key, value
def parse_entry(f):
entry = {}
for line in f:
line = line.strip()
if len(line) < 5:
break
key, value = split_line(line)
if key not in entry:
entry[key] = value
elif key in entry:
if not isinstance(entry[key], list):
entry[key] = [entry[key]]
entry[key].append(value)
return entry
def parse_file(file_path):
i = 0
with open(file_path) as f:
for line in f:
if line.startswith('inetnum'):
inetnum = split_line(line)[1]
entry = parse_entry(f)
db[inetnum] = entry
if i == 250000:
print 'done with 250k'
db.sync()
i = 0
i += 1
db.close()
if __name__ == '__main__':
parse_file('ripe.db.inetnum')
此脚本将整个数据库保存到名为ipnum.db的数据库中,您可以轻松更改输出目标以及刷新频率。
db.sync()有点显示,因为bsddb会自动刷新这些数据量。