美丽的汤表,停止获取信息

时间:2014-05-13 00:41:27

标签: python html5 parsing html-parsing beautifulsoup

嘿大家我有一些我正在解析的HTML,这里是:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title></title>
</head>

<body>
    <table class="dayinner">
        <tr class="lun">
            <td class="mealname" colspan="3">LUNCH</td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Deli</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000010000047598_35356" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox" /> <span class="ul" onclick=
                    "nf('0000047598_35356');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Made to Order Deli Core</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000046033_63436" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox" /> <span class="ul" onclick=
                    "nf('0000046033_63436');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Chicken Caesar Wrap</span>
                </div>
            </td>

            <td class="price"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="height:3px;"></td>
        </tr>

        <tr class="lun">
            <td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
        </tr>

        <tr class="lun">
            <td class="station">&nbsp;Dessert</td>
            <td class="station">&nbsp;</td>

            <td class="menuitem">
                <div class="menuitem">
                    <input class="chk" id="S1L0000020000046033_63436" onclick=
                    "rptlist(this);" onmouseout="wschk(0);" onmouseover=
                    "wschk(1);" type="checkbox" /> <span class="ul" onclick=
                    "nf('0000046033_63436');" onmouseout="pcls(this);"
                    onmouseover="ws(this);">Chicken Caesar Wrap</span>
                </div>
            </td>
        </tr>
    </table>
</body>
</html>

这是我的代码,我只想要熟食部分下的项目,通常我不知道有多少方法可以做到这一点?

soup = BeautifulSoup(open("upperMenu.html"))

title = soup.find('td', class_='station').text.strip()

spans = soup.find_all('span', class_='ul')[:2]

但这仅在有两个项目时有效,如果项目数量未知,我怎样才能使用?

提前致谢

1 个答案:

答案 0 :(得分:0)

您可以使用text函数中的find_all属性来查找其站点列包含子字符串Deli的所有行.2。遍历每一行并找到该行中的跨度classul

import re
soup = BeautifulSoup(text)

tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))

for td in tds_deli:
    try:
        tr = td.find_parent()
        spans = tr.find_all('span', {'class':'ul'})
        for span in spans:
            # do something
            print span.text
        print '------------one row -------------'
    except:
        pass
在这种情况下

示例输出:

Made to Order Deli Core
------------one row -------------

不确定我是否正确理解了问题,但我认为我的代码可能会帮助您入门。