Question

我正在使用以下＆＃39;阻止＆＃39; HTML：

<div class="marketing-directories-results">
    <ul>
        <li>
            <div class="contact-details">
                <h2>
                    A I I Insurance Brokerage of Massachusetts Inc
                </h2>
                <br/>
                <address>
                    183 Davis St
                    <br/>
                    East Douglas
                    <br/>
                    Massachusetts
                    <br/>
                    U S A
                    <br/>
                    MA 01516-113
                </address>
                <p>
                    <a href="http://www.agencyint.com">
                        www.agencyint.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-0">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-0 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Engineering
                        </li>
                        <li>
                            NM General Liability (US direct)
                        </li>
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
        <li>
            <div class="contact-details">
                <h2>
                    ABCO Insurance Underwriters Inc
                </h2>
                <br/>
                <address>
                    ABCO Building, 350 Sevilla Avenue, Suite 201
                    <br/>
                    Coral Gables
                    <br/>
                    Florida
                    <br/>
                    U S A
                    <br/>
                    33134
                </address>
                <p>
                    <a href="http://www.abcoins.com">
                        www.abcoins.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-1">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-1 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
    </ul>
</div>

我从这个HTML抓取多个数据点。给我带来麻烦的是＆＃34;接受业务来自：＆＃34;和＆＃34;业务类别＆＃34;值。我可以接受＆＃34;接受业务来自：＆＃34;值，无论它出现在哪个顺序：

try:
    li_area = company.find('ul', class_='result-info info-cov-' + 
                                  str(company_counter) + ' cc')
    li_stuff = li_area.find_all('li')
    for li in li_stuff:
        if li.strong.text.strip() == 'Accepts Business From:':
            business_final = li.find('li').text.strip()
except AttributeError:
    pass

注意：＆＃34;公司＆＃34;变量是包含我上面粘贴的html的beautifulsoup对象。

注意：页面上每个记录的类名都会更改 - 我只在HTML示例中包含了一条记录，以保持一些简洁。

当我尝试相同的代码块时，这次用'Accepts Business From:'替换li.strong.text.strip（）== 'Classes of business'，但代码似乎没有检测到强标记，只是接受来自：＆＃39;的业务。我的for循环是否不正确，并且实际上没有迭代每个包含这些不同强标签的<li>标签？难道这个强大的标签的真正价值与“业务类别”不同吗？＆＃39; （我确实直接从网站的HTML中复制了这个值。）

您可以提供的任何帮助都非常感谢

Answer 1

您获取'Accepts Business From:'而不是'Classes of business'的文字的原因是您在错误的地方使用try-except。

在for li in li_stuff:循环的第二次迭代中，li变为<li>U.S.A</li>，因为没有AttributeError，它会抛出li.strong来调用<strong> {1}}标签存在。并且，根据您当前的try-except，错误会在for循环外部发生，并且pass为for li in li_stuff: try: if li.strong.text.strip() == 'Accepts Business From:': business_final = li.find('li').text.strip() print('Accepts Business From:', business_final) if li.strong.text.strip() == 'Classes of business': business_final = li.find('li').text.strip() print('Classes of business:', business_final) except AttributeError: pass # or you can use 'continue' too.。因此，循环不会达到第三次迭代，它应该获取“业务类”的文本。

要在捕获到错误后继续循环，请使用：

Accepts Business From: U.S.A
Classes of business: Engineering

输出：

if li.strong.text.strip() == 'Classes of business':
    business_final = ', '.join([x.text.strip() for x in li.find_all('li')])
    print('Classes of business:', business_final)

但是，由于“业务类”存在许多值，您可以将代码更改为此以获取所有值：

Accepts Business From: U.S.A
Classes of business: Engineering, NM General Liability (US direct), Property D&F (US binder), Terrorism

输出：

class ResultRow extends PureComponent {
  render() {

    const Comp = Icon[this.props.name];

    return (
      <div className="component-result-row">
        <Comp />
      </div>
    );
}}

Beautifulsoup4 - 通过强标记值识别信息仅适用于标记的某些值

1 个答案: