当组都在同一元素中时,使用BeautifulSoup将HTML分成组

时间:2010-06-26 16:30:19

标签: python html parsing beautifulsoup

以下是一个例子:

<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>

如果每只动物都在一个单独的元素中,我可以迭代元素。那太好了。但我试图解析的网站将所有信息都放在一个元素中。

将汤分成不同的动物或以其他方式提取属性和属于哪种动物的最佳方法是什么?

(随意推荐一个更好的标题)

2 个答案:

答案 0 :(得分:2)

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

animals = []
attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animals.append(p.string)
    elif (p['class'] == 'attribute'):
        if animals[-1] not in attributes.keys():
            attributes[animals[-1]] = [p.string]
        else:
            attributes[animals[-1]].append(p.string)

print animals
print attributes

这应该有效。

答案 1 :(得分:2)

如果您不需要保留动物名称,可以像这样简化Jamie的答案

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animal = p.string
        attributes[animal] = []
    elif (p['class'] == 'attribute'):
        attributes[animal].append(p.string)

print attributes.keys()
print attributes