不能用beautifoulsoap来提取和标题

时间:2016-10-30 00:17:56

标签: python beautifulsoup

我有一些HTML代码,我需要在类中为某些类别提取tittle和href。 html是:

<div class="submenu_img3" >
                <ul class="submenu_list3 visible_false">
                        <li class="">

                <input type="hidden" name="has_subcategories" value="0"/>
                <input type="hidden" name="has_thirdlevel" value="0"/>
                <input type="hidden" name="level" value="0"/>
                <input type="hidden" name="posicion" value="0"/>
                <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_6511">
                    <span class="txt" >
                        Cerdo selecta                       </span>
                </a>
            </li>
                            <li class="">

                <input type="hidden" name="has_subcategories" value="0"/>
                <input type="hidden" name="has_thirdlevel" value="0"/>
                <input type="hidden" name="level" value="2"/>
                <input type="hidden" name="posicion" value="1"/>
                <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_130201">
                    <span class="txt" >
                        Cerdo Blanco                        </span>
                </a>
            </li>
                            <li class="">

                <input type="hidden" name="has_subcategories" value="0"/>
                <input type="hidden" name="has_thirdlevel" value="0"/>
                <input type="hidden" name="level" value="2"/>
                <input type="hidden" name="posicion" value="2"/>
                <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_130202">
                    <span class="txt" >
                        Cerdo de Teruel                     </span>
                </a>
            </li>
                            <li class="">

                <input type="hidden" name="has_subcategories" value="0"/>
                <input type="hidden" name="has_thirdlevel" value="0"/>
                <input type="hidden" name="level" value="2"/>
                <input type="hidden" name="posicion" value="3"/>
                <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_130203">
                    <span class="txt" >
                        Cerdo Ibérico                       </span>
                </a>
            </li>

但是使用这些代码我什么都得不到:

for row in soup.find_all('div',attrs={"class" : "submenu_img3"}, href=True):
    print row.text
    print row.a['href']
你能帮帮我吗?谢谢,抱歉我的英文!

1 个答案:

答案 0 :(得分:2)

我猜你的意图是使用class submenu_img3获取所有div标签中所有标签的href和文本。 find_all的问题是href属性。代码要求beautifulsoup使用href属性查找所有div标签,但HTML中没有。

我发现使用允许CSS选择器的select调用更加容易。以下是查找class submenu_imgg3

的div标签内所有标签的代码
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.select('div.submenu_img3 a'):
    print "Text:", row.text.strip()
    print "Href:", row['href']

完整代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

html_doc = """
<div class="submenu_img3" >
    <ul class="submenu_list3 visible_false">
        <li class="">
            <input type="hidden" name="has_subcategories" value="0"/>
            <input type="hidden" name="has_thirdlevel" value="0"/>
            <input type="hidden" name="level" value="0"/>
            <input type="hidden" name="posicion" value="0"/>
            <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_6511">
                <span class="txt" > Cerdo selecta </span>
            </a>
        </li>

        <li class="">
            <input type="hidden" name="has_subcategories" value="0"/>
            <input type="hidden" name="has_thirdlevel" value="0"/>
            <input type="hidden" name="level" value="2"/>
            <input type="hidden" name="posicion" value="1"/>
            <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_130201">
                <span class="txt" > Cerdo Blanco</span>
            </a>
        </li>

        <li class="">
            <input type="hidden" name="has_subcategories" value="0"/>
            <input type="hidden" name="has_thirdlevel" value="0"/>
            <input type="hidden" name="level" value="2"/>
            <input type="hidden" name="posicion" value="2"/>
            <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_130202">
                <span class="txt" > Cerdo de Teruel </span>
            </a>
        </li>

        <li class="">
            <input type="hidden" name="has_subcategories" value="0"/>
            <input type="hidden" name="has_thirdlevel" value="0"/>
            <input type="hidden" name="level" value="2"/>
            <input type="hidden" name="posicion" value="3"/>
            <a href="https://www.alimentacion.alcampo.es/tienda/index.php?cPath=2112_13_1302_130203">
                <span class="txt" > Cerdo Ibérico  </span>
            </a>
        </li>
    </ul>
</div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.select('div.submenu_img3 a'):
    print "Text:", row.text.strip()
    print "Href:", row['href'] 

请参阅CSS选择器的W3C链接。 CSS选择器非常强大

http://www.w3schools.com/cssref/css_selectors.asp