我有一个巨大的XML文件(1.2 G),其中包含数百万个MusicAlbums的信息,每个都有一个简单的格式如下
<MusicAlbum>
<MusicType>P</MusicType>
<Title>22 Exitos de Oro [Brentwood]</Title>
<Performer>Chayito Valdéz</Performer>
</MusicAlbum>
...
<MusicAlbum>
<MusicType>A</MusicType>
<Title>Bye Bye</Title>
<Performer>Emma Aster</Performer>
</MusicAlbum>
我可以在Python中读取和加载文件,但是当我将它传递给Beautifulsoup
时html = FID.read()
print "Converting to Soup"
soup = BeautifulSoup(html)
print "Conversion Completed"
我明白了
Converting to Soup
Killed
显然被杀死是Beautifulsoup打印的东西 一种解决方案是将html拆分成块,每个块包含信息“MusicAlbum”和“/ MusicAlbum”块,然后将它们传递给Beautifulsoup,但只是想确定是否有更简单的解决方案。
答案 0 :(得分:2)
检查这是否适合您,它不会很快但不应使用超出您需要的内存:
# encoding:utf-8
import re
data = """ <MusicAlbum>
<MusicType>P</MusicType>
<Title>22 Exitos de Oro [Brentwood]</Title>
<Performer>Chayito Valdéz</Performer>
</MusicAlbum>
...
<MusicAlbum>
<MusicType>A</MusicType>
<Title>Bye Bye</Title>
<Performer>Emma Aster</Performer>
</MusicAlbum>"""
MA = re.compile(r'<MusicAlbum>(.*?)</MusicAlbum>', re.DOTALL)
TY = re.compile(r'<MusicType>(.*)</MusicType>')
TI = re.compile(r'<Title>(.*)</Title>')
P = re.compile(r'<Performer>(.*)</Performer>')
albums = []
for album in re.findall(MA, data):
albums.append({
'type': re.search(TY, album).group(),
'title': re.search(TI, album).group(),
'performer': re.search(P, album).group()})
print albums