Question

我有一个具有以下结构的页面：

<div id ="a">
    <table>
        <td> 
            <!-- many tables and divs here -->
        </td>
        <td>
            <table></table>
            <table></table>
            <div class="tabber">
                <table></table>
                <table></table>  <!-- TARGET TABLE -->
            </div>
        </td>
    </table>
</div>

这是对的，遗憾的是，除了“tabber”之外，目标或附近没有任何ID或类。

我试图获得div元素：

content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)

stats_div = soup.findAll('div', class_ = "tabber")[1] # 1 because there are 4 elements on page with that class and number 2 is the target one

但它不起作用，总是不输出任何东西。

我试图从头开始遍历整个树以获得目标表：

stats_table = soup.find(id='a').findChildren('table')[0].findChildren('td')[1].findChildren('div')[0].findChildren('table')[1]

但它也不起作用。显然findChildren('td')没有获得第一个表的直接子项，而是获得所有后代。超过100个td元素。

如何获得元素的直接子元素？

是否有更简洁的方法来遍历这样丑陋的嵌套树？为什么我不能按班级选择div？它会简化一切。

Answer 1

您所显示的代码似乎都没有反映该页面上的任何内容：

没有div标记id='a'。实际上，没有具有该属性的单个标签。这就是你的最后一个命令stats_table = ...失败的原因。

正好有3个div个标签，其class属性等于tabber，而不是4：

>>> len(soup.find_all('div', class_="tabber"))
3

他们也不会空着：

>>> len(soup.find_all('div', class_="tabber")[1])
7

类div的单tabber个标记只有2 table个孩子，但我认为这是因为你减少了你自己的例子很大。

如果你想抓一个这样的网站，你可以通过一个独特的id轻松选择标签，那么你别无选择，只能帮助自己处理其他属性，例如标签名。有时，DOM中标签的位置相互比较也是一种有用的技术。

针对您的具体问题，您可以使用title属性效果很好：

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> url = 'http://www.soccerstats.com/team.asp?league=england&teamid=24'
>>> soup = BeautifulSoup(urllib2.urlopen(url).read(), 'lxml')
>>> all_stats = soup.find('div', id='team-matches-and stats')
>>> left_column, right_column = [x for x in all_stats.table.tr.children if x.name == 'td']
>>> table1, table2 = [x for x in right_column.children if x.name == 'table']  # the two tables at the top right
>>> [x['title'] for x in right_column.find_all('div', class_='tabbertab')]
['Stats', 'Scores', 'Goal times', 'Overall', 'Home', 'Away']

这里的最后一部分是有趣的部分：右下角的所有表格都有title个属性，这样您就可以更轻松地选择它们。此外，这些属性使汤中的标签独一无二，因此您可以直接从根中选择它们：

>>> stats_div = soup.find('div', class_="tabbertab", title="Stats")
>>> len(stats_div.find_all('table', class_="stat"))
3

这3个项目对应于＆＃34;当前条纹＆＃34;，＆＃34;得分＆＃34;和＆＃34;主场/客场优势＆＃34;子项目。

Python美丽汤如何获得深层嵌套元素

1 个答案: