Question

编辑：我还想帮助抓住每个公司列出的“业务类别”值，但使用下面的答案，基于我的代码，strong.text.strip() == 'Classes of business'似乎不是True

我对beautifulsoup和python非常有经验，但无论出于何种原因，似乎无法抓住这些数据。下面是我正在使用的HTML块：

<div class="marketing-directories-results">
    <ul>
        <li>
            <div class="contact-details">
                <h2>
                    A I I Insurance Brokerage of Massachusetts Inc
                </h2>
                <br/>
                <address>
                    183 Davis St
                    <br/>
                    East Douglas
                    <br/>
                    Massachusetts
                    <br/>
                    U S A
                    <br/>
                    MA 01516-113
                </address>
                <p>
                    <a href="http://www.agencyint.com">
                        www.agencyint.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-0">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-0 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Engineering
                        </li>
                        <li>
                            NM General Liability (US direct)
                        </li>
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
        <li>
            <div class="contact-details">
                <h2>
                    ABCO Insurance Underwriters Inc
                </h2>
                <br/>
                <address>
                    ABCO Building, 350 Sevilla Avenue, Suite 201
                    <br/>
                    Coral Gables
                    <br/>
                    Florida
                    <br/>
                    U S A
                    <br/>
                    33134
                </address>
                <p>
                    <a href="http://www.abcoins.com">
                        www.abcoins.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-1">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-1 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
    </ul>
</div>

显示的页面有10条记录（我只包含了前两个的HTML，因此我可以帮助遍历每个公司），每个记录对应一个公司以及有关它们的更多信息，例如他们的地址，网站网址，以及诸如“接受来自美国的商业”这样的事情

我已经能够获取姓名，地址和网站网址，但我无法获得“U.S.A”。在每个公司的“接受业务”（如果有的话）下并将其存储在列表中的正确位置。

我可以通过以下方式到达第一个美国：

other_info = comp_info_area.find_all('li')

other_info_next = other_info[0].find('ul')
other_info_next_next = other_info_next.find_all('li')
other_info_next_next_next = other_info_next_next[0].find('ul', class_='cc')
other_info_next_next_next_next = other_info_next_next_next.find('li')
print(other_info_next_next_next_next.text)

其中comp_info_area是存储上述HTML的Beautifulsoup对象。这将返回：U.S.A

我怎样才能抓住剩下的东西？我无法弄清楚如何导航树到达那里。非常感谢任何帮助，谢谢。

编辑：以下是没有该信息的公司的示例：

<li>
    <div class="contact-details">
        <h2>
            Acadian Managers, LLC
        </h2>
        <br/>
        <address>
            8550 United Plaza Boulevard
            <br/>
            Suite 702
            <br/>
            Baton Rouge
            <br/>
            Louisiana
            <br/>
            U.S.A
            <br/>
            70809
        </address>
        <p>
            <a href="http://www.acadianmanagers.com">
                www.acadianmanagers.com
            </a>
        </p>
    </div>
</li>

Answer 1

您可以为每个公司创建一个字典，然后将其附加到列表中。

# Get the <ul> tag which contains all the companies.
results = soup.find('div', class_='marketing-directories-results').ul

companies_info = []

# Iterate over the companies (all <li> tags that are direct children of results,
# can be found by setting 'recursive=False').
for company in results.find_all('li', recursive=False):
    company_info = {}
    company_info['Name'] = company.find('h2').text.strip()
    company_info['Address'] = company.find('address').get_text(', ', strip=True)
    company_info['Website'] = company.find('a', href=True)['href']

    try:
        li = company.find('ul', class_='result-info').find('li')
        if li.strong.text.strip() == 'Accepts Business From:':
            company_info['Accepts Business From'] = li.find('li').text.strip()
    except AttributeError:
        # If this error is caught, it means this info is not available.
        # You can use the below line to set it to 'None', or simply use 'pass' to do nothing.
        company_info['Accepts Business From'] = None

    companies_info.append(company_info)

print(companies_info)

输出：

[
    {
        'Name': 'A I I Insurance Brokerage of Massachusetts Inc', 
        'Address': '183 Davis St, East Douglas, Massachusetts, U S A, MA 01516-113', 
        'Website': 'http://www.agencyint.com', 
        'Accepts Business From': 'U.S.A'
    }, 
    {
        'Name': 'ABCO Insurance Underwriters Inc', 
        'Address': 'ABCO Building, 350 Sevilla Avenue, Suite 201, Coral Gables, Florida, U S A, 33134', 
        'Website': 'http://www.abcoins.com', 
        'Accepts Business From': 'U.S.A'
    }, 
    {
        'Name': 'Acadian Managers, LLC', 
        'Address': '8550 United Plaza Boulevard, Suite 702, Baton Rouge, Louisiana, U.S.A, 70809', 
        'Website': 'http://www.acadianmanagers.com', 
        'Accepts Business From': None
    }
]

Beautifulsoup4 - 获取多个非唯一标记的文本字符串并存储在列表中

1 个答案: