用美丽的汤解析,获得不同层次的节点

时间:2014-05-07 04:26:20

标签: python html parsing beautifulsoup

我正在尝试获取deli title,然后在deli title下获取两个菜单项Made to Order Deli CoreTurkey Chipotle Petite Wrap ?我使用美丽的汤4来做这个,它不起作用。主菜时间也是如此吗?

<html>
<head>
    <title></title>
</head>

<body>
    <table class="dayinner">
        <tr class="lun">
            <td class="mealname" colspan="3">LUNCH</td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Deli</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000010000047598_35356" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047598_35356');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Made to Order Deli Core</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000047933_06835" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047933_06835');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="height:3px;"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Entrée</td>

            <td class="menuitem">
                <div class="menuitem"><input class="chk" id=
                "S1L0000030000044794_08943" onclick="rptlist(this);"
                onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
                <span class="ul" onclick="nf('0000044794_08943');" onmouseout=
                "pcls(this);" onmouseover="ws(this);">Steamed
                Corn</span><img alt="Vegan" class="icon" src=
                "images/g_062.gif"><img alt="Mindful Item" class="icon" src=
                "images/m_051.gif"></div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000040000033087_22244" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000033087_22244');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>
    </table>
</body>
</html>

或者如果我可以将它变成这样的XML格式:

<counter name="Deli">
    <dish>
        <name>Made to Order Deli Core</name>
    </dish>
    <dish>
        <name>Turkey Chipotle Petite Wrap</name>
    </dish>
</counter>

非常感谢你,我真的很感谢你花时间帮助我。

2 个答案:

答案 0 :(得分:1)

实际上我使用了美丽的汤和元素树(用于xml解析) 获取<span>

中的所有元素
# -*- coding: UTF-8 -*-

from bs4 import *
import xml.etree.ElementTree as ET

html='''<html>
<head>
    <title></title>
</head>

<body>
    <table class="dayinner">
        <tr class="lun">
            <td class="mealname" colspan="3">LUNCH</td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Deli</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000010000047598_35356" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047598_35356');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Made to Order Deli Core</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000047933_06835" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000047933_06835');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Turkey Chipotle Petite Wrap</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="height:3px;"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Entrée</td>

            <td class="menuitem">
                <div class="menuitem"><input class="chk" id=
                "S1L0000030000044794_08943" onclick="rptlist(this);"
                onmouseout="wschk(0);" onmouseover="wschk(1);" type="checkbox">
                <span class="ul" onclick="nf('0000044794_08943');" onmouseout=
                "pcls(this);" onmouseover="ws(this);">Steamed
                Corn</span><img alt="Vegan" class="icon" src=
                "images/g_062.gif"><img alt="Mindful Item" class="icon" src=
                "images/m_051.gif"></div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000040000033087_22244" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox"> <span class="ul" onclick=
                    "nf('0000033087_22244');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Cuban Mojo Roasted Pork Loin</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>
    </table>
</body>
</html> '''

soup = BeautifulSoup(html)

counter = ET.Element('counter')
counter.set("name", "#Deli")





for i in soup.findAll('span'):
    dish = ET.SubElement(counter, 'dish')
    name = ET.SubElement(dish, 'name')
    name.text= i.text.replace('\n',' ')

print ET.dump(counter)

答案 1 :(得分:1)

你可以这样:

# -*- coding: utf-8 -*-

soup = BeautifulSoup(html)
title = soup.find('td', class_='station').text.strip()

spans = soup.find_all('span', class_='ul')

# create the root of the XML file
root = ET.Element("counter")
root.set("name", title)

for item in spans:
    # retrieve the text inside the <td class="station">
    text = list(list(item.parents)[2].previous_siblings)[1].text.strip()
    if text == u'Entrée':
        break

    dish = ET.SubElement(root, 'dish')
    name = ET.SubElement(dish, 'name')
    name.text = item.text.rstrip()

tree = ET.ElementTree(root)
tree.write("filename.xml")

这是所需xml文件的内容:

<counter name="Deli">
    <dish>
        <name>Made to Order Deli Core</name>
    </dish> 
    <dish>
        <name>Turkey Chipotle Petite Wrap</name>
    </dish>
</counter>

非常重要的是在文件开头的上方包含以下行# -*- coding: utf-8 -*-以避免重音出现问题,有关详细信息,请参阅SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'