我正在尝试使用超过4 GB的python解析db
文件。
db文件中的示例:
% Tags relating to '217.89.104.48 - 217.89.104.63'
% RIPE-USER-RESOURCE
inetnum: 194.243.227.240 - 194.243.227.255
netname: PRINCESINDUSTRIEALIMENTARI
remarks: INFRA-AW
descr: PRINCES INDUSTRIE ALIMENTARI
descr: Provider Local Registry
descr: BB IBS
country: IT
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: order.manager2@telecomitalia.it
mnt-by: INTERB-MNT
changed: unread@ripe.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
% Tags relating to '194.243.227.240 - 194.243.227.255'
% RIPE-USER-RESOURCE
inetnum: 194.16.216.176 - 194.16.216.183
netname: SE-CARLSTEINS
descr: CARLSTEINS TRAFIK AB
org: ORG-CTA17-RIPE
country: SE
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: mntripe@telia.net
mnt-by: TELIANET-LIR
changed: unread@ripe.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
我想解析以% Tags relating to
我想要提取inetnum
和第一个descr
这是我到目前为止所做的:(已更新)
import re
with open('test.db', "r") as f:
content = f.read()
r = re.compile(r''
'descr:\s+(.*?)\n',
re.IGNORECASE)
res = r.findall(content)
print res
答案 0 :(得分:1)
因为它超过4gb文件你不想使用f.read()
一次读取所有文件但是使用文件对象作为迭代器(当你迭代一个文件时,你会得到另一行的一行)
以下genererator应该完成这项工作
def parse(filename):
current= None
for l in open(filename):
if l.startswith("% Tags relating to"):
if current is not None:
yield current
current = {}
elif l.startswith("inetnum:"):
current["inetnum"] = l.split(":",1)[1].strip()
elif l.startswith("descr") and not "descr" in current:
current["descr"] = l.split(":",1)[1].strip()
if current is not None:
yield current
您可以将其用作以下
for record in parse("test.db"):
print (record)
结果在测试文件中:
{'inetnum': '194.243.227.240 - 194.243.227.255', 'descr': 'PRINCES INDUSTRIE ALIMENTARI'}
{'inetnum': '194.16.216.176 - 194.16.216.183', 'descr': 'CARLSTEINS TRAFIK AB'}
答案 1 :(得分:0)
如果您只想获得第一个 descr:
r = re.compile(r''
'descr:\s+(.*?)\n(?:descr:.*\n)*',
re.IGNORECASE)
如果你想要inetnum和第一个descr:
[ a + b for (a,b) in re.compile(r''
'(?:descr:\s+(.*?)\n(?:descr:.*\n)*)|(?:inetnum:\s+(.*?)\n)',
re.IGNORECASE) ]
我必须承认我没有使用% Tags relating to
,而且我认为所有descr
都是连续的。