Question

以下是我的HTML的样子：

<head> ... </head>

<body>
    <div>
        <h2>Something really cool here<h2>
        <div class="section mylist">
            <table id="list_1" class="table">
                <thead> ... not important <thead>
                <tr id="blahblah1"> <td> ... </td> </tr> 
                <tr id="blah2"> <td> ... </td> </tr> 
                <tr id="bl3"> <td> ... </td> </tr> 
            </table>
        </div>
    </div>
</body>

现在我的html文件中出现了这个div四次，每个表内容都不同，每个h2文本都不同。其他一切都是相同的。到目前为止我能够做的就是提取出每个h2的父级 - 然而，现在我不知道如何提取出每个tr在那里，我可以提取{ {1}}我真的需要。

这是我到目前为止编写的代码......

td

Answer 1

我建议找到父div，其中实际上包围该表，然后搜索所有td标记。以下是您的工作方式：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'lxml')

div = soup.find('div', class_='section mylist')    
for td in div.find_all('td'):
    print(td.text)

Answer 2

搜索了一下，意识到这是我的解析器引起的问题。我安装了lxml，现在一切正常。

Why is BeautifulSoup not finding a specific table class?

BeautifulSoup在表格中找到文本

2 个答案: