美丽的汤不能处理大文件

时间:2014-02-19 16:25:45

标签: python beautifulsoup

我有一个巨大的XML文件(1.2 G),其中包含数百万个MusicAlbums的信息,每个都有一个简单的格式如下

    <MusicAlbum>
      <MusicType>P</MusicType>
      <Title>22 Exitos de Oro [Brentwood]</Title>
      <Performer>Chayito Valdéz</Performer>
    </MusicAlbum>
...
    <MusicAlbum>
      <MusicType>A</MusicType>
      <Title>Bye Bye</Title>
      <Performer>Emma Aster</Performer>
    </MusicAlbum>

我可以在Python中读取和加载文件,但是当我将它传递给Beautifulsoup

html = FID.read()
print "Converting to Soup"
soup = BeautifulSoup(html)
print "Conversion Completed"

我明白了

Converting to Soup
Killed

显然被杀死是Beautifulsoup打印的东西 一种解决方案是将html拆分成块,每个块包含信息“MusicAlbum”和“/ MusicAlbum”块,然后将它们传递给Beautifulsoup,但只是想确定是否有更简单的解决方案。

1 个答案:

答案 0 :(得分:2)

检查这是否适合您,它不会很快但不应使用超出您需要的内存:

# encoding:utf-8
import re

data = """    <MusicAlbum>
      <MusicType>P</MusicType>
      <Title>22 Exitos de Oro [Brentwood]</Title>
      <Performer>Chayito Valdéz</Performer>
    </MusicAlbum>
...
    <MusicAlbum>
      <MusicType>A</MusicType>
      <Title>Bye Bye</Title>
      <Performer>Emma Aster</Performer>
    </MusicAlbum>"""

MA = re.compile(r'<MusicAlbum>(.*?)</MusicAlbum>', re.DOTALL)
TY = re.compile(r'<MusicType>(.*)</MusicType>')
TI = re.compile(r'<Title>(.*)</Title>')
P = re.compile(r'<Performer>(.*)</Performer>')

albums = []
for album in re.findall(MA, data):
    albums.append({
        'type': re.search(TY, album).group(),
        'title': re.search(TI, album).group(),
        'performer': re.search(P, album).group()})


print albums