Beautifulsoup4 - 通过强标记值识别信息仅适用于标记的某些值

时间:2018-02-27 21:12:48

标签: python html python-3.x web-scraping beautifulsoup

我正在使用以下'阻止' HTML:

<div class="marketing-directories-results">
    <ul>
        <li>
            <div class="contact-details">
                <h2>
                    A I I Insurance Brokerage of Massachusetts Inc
                </h2>
                <br/>
                <address>
                    183 Davis St
                    <br/>
                    East Douglas
                    <br/>
                    Massachusetts
                    <br/>
                    U S A
                    <br/>
                    MA 01516-113
                </address>
                <p>
                    <a href="http://www.agencyint.com">
                        www.agencyint.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-0">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-0 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Engineering
                        </li>
                        <li>
                            NM General Liability (US direct)
                        </li>
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
        <li>
            <div class="contact-details">
                <h2>
                    ABCO Insurance Underwriters Inc
                </h2>
                <br/>
                <address>
                    ABCO Building, 350 Sevilla Avenue, Suite 201
                    <br/>
                    Coral Gables
                    <br/>
                    Florida
                    <br/>
                    U S A
                    <br/>
                    33134
                </address>
                <p>
                    <a href="http://www.abcoins.com">
                        www.abcoins.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-1">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-1 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
    </ul>
</div>

我从这个HTML抓取多个数据点。给我带来麻烦的是&#34;接受业务来自:&#34;和&#34;业务类别&#34;值。我可以接受&#34;接受业务来自:&#34;值,无论它出现在哪个顺序:

try:
    li_area = company.find('ul', class_='result-info info-cov-' + 
                                  str(company_counter) + ' cc')
    li_stuff = li_area.find_all('li')
    for li in li_stuff:
        if li.strong.text.strip() == 'Accepts Business From:':
            business_final = li.find('li').text.strip()
except AttributeError:
    pass

注意:&#34;公司&#34;变量是包含我上面粘贴的html的beautifulsoup对象。

注意:页面上每个记录的类名都会更改 - 我只在HTML示例中包含了一条记录,以保持一些简洁。

当我尝试相同的代码块时,这次用'Accepts Business From:'替换li.strong.text.strip()== 'Classes of business',但代码似乎没有检测到强标记,只是接受来自:&#39;的业务。我的for循环是否不正确,并且实际上没有迭代每个包含这些不同强标签的<li>标签?难道这个强大的标签的真正价值与“业务类别”不同吗?&#39; (我确实直接从网站的HTML中复制了这个值。)

您可以提供的任何帮助都非常感谢

1 个答案:

答案 0 :(得分:1)

您获取'Accepts Business From:'而不是'Classes of business'的文字的原因是您在错误的地方使用try-except

for li in li_stuff:循环的第二次迭代中,li变为<li>U.S.A</li>,因为没有AttributeError,它会抛出li.strong来调用<strong> {1}}标签存在。并且,根据您当前的try-except,错误会在for循环外部发生,并且passfor li in li_stuff: try: if li.strong.text.strip() == 'Accepts Business From:': business_final = li.find('li').text.strip() print('Accepts Business From:', business_final) if li.strong.text.strip() == 'Classes of business': business_final = li.find('li').text.strip() print('Classes of business:', business_final) except AttributeError: pass # or you can use 'continue' too. 。因此,循环不会达到第三次迭代,它应该获取“业务类”的文本。

要在捕获到错误后继续循环,请使用:

Accepts Business From: U.S.A
Classes of business: Engineering

输出:

if li.strong.text.strip() == 'Classes of business':
    business_final = ', '.join([x.text.strip() for x in li.find_all('li')])
    print('Classes of business:', business_final)

但是,由于“业务类”存在许多值,您可以将代码更改为此以获取所有值:

Accepts Business From: U.S.A
Classes of business: Engineering, NM General Liability (US direct), Property D&F (US binder), Terrorism

输出:

class ResultRow extends PureComponent {
  render() {

    const Comp = Icon[this.props.name];

    return (
      <div className="component-result-row">
        <Comp />
      </div>
    );
}}