Question

最近我开始刮掉多个页面，但页面结构真的难以刮掉。它有很多＆＃34; nth类型＆＃34;每个自我都没有课程的元素。但他们的父母分享同一个班级。我正在使用BeautifulSoup，直到我看到这个糟糕的代码才很棒......

<div class="detail-50">
    <div class="detail-panel-wrap">
        <h3>Contact details</h3>
            Website: <a href="http://www.somewebsitefrompage.com">http://www.somewebsitefrompage.com</a><br />Email: <a href="mailto:somemailfrompage.com">somemailfrompage.com</a><br />Tel: 11111111 111
                    </div>
                        </div>

现在似乎没问题，但我想抓网站，电子邮件和电话。分别。我试过很多方法，比如

website = soup.select('div.detail-panel-wrap')[1].text`

但是没有工作..现在，当其他元素与联系人详细信息具有相同的类时，会遇到很大的问题：

<div class="detail-50">
    <div class="detail-panel-wrap">
        <h3>Public address</h3>
            Mr Martin Austin, Some street, Some city, some ZIP
                    </div>
                        </div>

这个是地址，我也需要刮掉。还有很多其他的＆＃39; div＆＃39;这两个名字。有没有人有解决方案？如果有人不理解，我可以更好地解释，对不好的解释感到抱歉。

更新
有了一些选择器软件，我已经知道它应该如何，但在python中编写它很困难。以下是如何从页面中找到电话：

div#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(1) div.detail-panel-wrap

这个是地址

div#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(2) div.detail-panel-wrap

这是一个网站

div.detail-50 a:nth-of-type(1)

这是联系电子邮件

div.detail-panel-wrap a:nth-of-type(2)

注意： ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact

是所有这些中最重要的父div类。

有人知道如何用BS4 Python编写那些吗？

Answer 1

如果有多个带有 detail-panel-wrap 的 divs ，您可以使用 h3 文本来获取所需的文本：

contact = soup.find("h3", text="Contact details").parent
address = soup.find("h3", text="Public address").parent

如果我们在样本上运行它，你可以看到我们得到两个div：

In [22]: html = """
   ....: <div class="detail-50">
   ....:     <div class="detail-panel-wrap">
   ....:         <h3>Contact details</h3>
   ....:             Website: <a href="http://www.somewebsitefrompage.com">http://www.somewebsitefrompage.com</a><br />Email: <a href="mailto:somemailfrompage.com">somemailfrompage.com</a><br />Tel: 11111111 111
   ....:                     </div>
   ....:     </div>
   ....:     <div class="detail-50">
   ....:         <div class="detail-panel-wrap">
   ....:             <h3>Public address</h3>
   ....:                  Mr Martin Austin, Some street, Some city, some ZIP
   ....:         </div>
   ....:     </div>
   ....:     <div class="detail-panel-wrap">
   ....:     <   h3>foo/h3>
   ....:     </div>
   ....:     <div class="detail-panel-wrap">
   ....:         <h3>bar/h3>
   ....:     </div>
   ....: </div>
   ....:     """

In [23]: from bs4 import BeautifulSoup

In [24]: soup = BeautifulSoup(html,"lxml")

In [25]: contact = soup.find("h3", text="Contact details").parent

In [26]: address = soup.find("h3", text="Public address").parent

In [27]: print(contact)
<div class="detail-panel-wrap">
<h3>Contact details</h3>
            Website: <a href="http://www.somewebsitefrompage.com">http://www.somewebsitefrompage.com</a><br/>Email: <a href="mailto:somemailfrompage.com">somemailfrompage.com</a><br/>Tel: 11111111 111
                    </div>

In [28]: print(address)
<div class="detail-panel-wrap">
<h3>Public address</h3>
                 Mr Martin Austin, Some street, Some city, some ZIP
        </div>

可能还有其他方法，但没有看到完整的html结构，就无法知道。

对于您的编辑，您只需使用 select_one 选择器：

 telephone = soup.select_one("#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(1) div.detail-panel-wrap")            

address = soup.select_one("#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(2) div.detail-panel-wrap")


website = soup.select_one("div.detail-50 a:nth-of-type(1)")

email = soup.select_one("div.detail-panel-wrap a:nth-of-type(2)")

但是不能保证仅仅因为选择器在chrome工具等中工作..它们将在你得到的源上工作。

Python BeautifulSoup刮取第n种元素

1 个答案: