我的sql文件是2.4千兆字节,有超过100000个插入命令。使用mmap,我可以在2分钟内提取1000个opendata数据集。
一些正则表达式匹配涵盖多个插入命令!我怎么能避免这个?
我的代码:
# -*- coding: utf-8 -*-
import re
regex = r'''<thing(.*?\r?\n)+?.*?<bbb(.*?\r?\n)+?.*?<three[^\r\n]+opendata(.*?\r?\n)+?.*?</thing.*?>'''
input_string = r'''
INSERT INTO blah...
<thing>
<bbb>
<three>beer</three>
</bbb>
</thing>
INSERT INTO blah...
<thing>
<bbb>
<three>opendata</three>
</bbb>
</thing>
INSERT INTO blah...
<thing>
<bbb>
<three>opendata2</three>
</bbb>
</thing>
'''
items = re.finditer(regex, input_string, re.I)
for item in items:
print 'Start...'
print item.group(0)
print '...end'
print
输出:
Start...
<thing>
<bbb>
<three>beer</three>
</bbb>
</thing>
INSERT INTO blah...
<thing>
<bbb>
<three>opendata</three>
</bbb>
</thing>
...end
Start...
<thing>
<bbb>
<three>opendata2</three>
</bbb>
</thing>
...end
我想要的是什么:
Start...
<thing>
<bbb>
<three>opendata</three>
</bbb>
</thing>
...end
Start...
<thing>
<bbb>
<three>opendata2</three>
</bbb>
</thing>
...end