Beautifulsoup4 - 获取多个非唯一标记的文本字符串并存储在列表中

时间:2018-02-22 22:52:49

标签: python html python-3.x web-scraping beautifulsoup

编辑:我还想帮助抓住每个公司列出的“业务类别”值,但使用下面的答案,基于我的代码,strong.text.strip() == 'Classes of business'似乎不是True

我对beautifulsoup和python非常有经验,但无论出于何种原因,似乎无法抓住这些数据。下面是我正在使用的HTML块:

<div class="marketing-directories-results">
    <ul>
        <li>
            <div class="contact-details">
                <h2>
                    A I I Insurance Brokerage of Massachusetts Inc
                </h2>
                <br/>
                <address>
                    183 Davis St
                    <br/>
                    East Douglas
                    <br/>
                    Massachusetts
                    <br/>
                    U S A
                    <br/>
                    MA 01516-113
                </address>
                <p>
                    <a href="http://www.agencyint.com">
                        www.agencyint.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-0">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-0 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Engineering
                        </li>
                        <li>
                            NM General Liability (US direct)
                        </li>
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
        <li>
            <div class="contact-details">
                <h2>
                    ABCO Insurance Underwriters Inc
                </h2>
                <br/>
                <address>
                    ABCO Building, 350 Sevilla Avenue, Suite 201
                    <br/>
                    Coral Gables
                    <br/>
                    Florida
                    <br/>
                    U S A
                    <br/>
                    33134
                </address>
                <p>
                    <a href="http://www.abcoins.com">
                        www.abcoins.com
                    </a>
                </p>
            </div>
            <span data-toggle=".info-cov-1">
                Additional trading information
                <i class="icon plus">
                </i>
            </span>
            <ul class="result-info info-cov-1 cc">
                <li>
                    <strong>
                        Accepts Business From:
                    </strong>
                    <ul class="cc">
                        <li>
                            U.S.A
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Classes of business
                    </strong>
                    <ul class="cc">
                        <li>
                            Property D&amp;F (US binder)
                        </li>
                        <li>
                            Terrorism
                        </li>
                    </ul>
                </li>
                <li>
                    <strong>
                        Disclaimer:
                    </strong>
                    <p>
                        Please note that while coverholders may have been approved by Lloyd's to accept business from the regions shown:
                    </p>
                    <p>
                        it is the responsibility of the parties, including the coverholder and any Lloyd's managing agent appointing them to ensure that the coverholder complies with all local regulatory and legal requirements; and
                    </p>
                    <p>
                        the coverholder may not provide cover for all classes they are approved to underwrite in all territories where they have approval.
                    </p>
                </li>
            </ul>
        </li>
    </ul>
</div>

显示的页面有10条记录(我只包含了前两个的HTML,因此我可以帮助遍历每个公司),每个记录对应一个公司以及有关它们的更多信息,例如他们的地址,网站网址,以及诸如“接受来自美国的商业”这样的事情

我已经能够获取姓名,地址和网站网址,但我无法获得“U.S.A”。在每个公司的“接受业务”(如果有的话)下并将其存储在列表中的正确位置。

我可以通过以下方式到达第一个美国:

other_info = comp_info_area.find_all('li')

other_info_next = other_info[0].find('ul')
other_info_next_next = other_info_next.find_all('li')
other_info_next_next_next = other_info_next_next[0].find('ul', class_='cc')
other_info_next_next_next_next = other_info_next_next_next.find('li')
print(other_info_next_next_next_next.text)

其中comp_info_area是存储上述HTML的Beautifulsoup对象。这将返回:U.S.A

我怎样才能抓住剩下的东西?我无法弄清楚如何导航树到达那里。非常感谢任何帮助,谢谢。

编辑:以下是没有该信息的公司的示例:

<li>
    <div class="contact-details">
        <h2>
            Acadian Managers, LLC
        </h2>
        <br/>
        <address>
            8550 United Plaza Boulevard
            <br/>
            Suite 702
            <br/>
            Baton Rouge
            <br/>
            Louisiana
            <br/>
            U.S.A
            <br/>
            70809
        </address>
        <p>
            <a href="http://www.acadianmanagers.com">
                www.acadianmanagers.com
            </a>
        </p>
    </div>
</li>

1 个答案:

答案 0 :(得分:0)

您可以为每个公司创建一个字典,然后将其附加到列表中。

# Get the <ul> tag which contains all the companies.
results = soup.find('div', class_='marketing-directories-results').ul

companies_info = []

# Iterate over the companies (all <li> tags that are direct children of results,
# can be found by setting 'recursive=False').
for company in results.find_all('li', recursive=False):
    company_info = {}
    company_info['Name'] = company.find('h2').text.strip()
    company_info['Address'] = company.find('address').get_text(', ', strip=True)
    company_info['Website'] = company.find('a', href=True)['href']

    try:
        li = company.find('ul', class_='result-info').find('li')
        if li.strong.text.strip() == 'Accepts Business From:':
            company_info['Accepts Business From'] = li.find('li').text.strip()
    except AttributeError:
        # If this error is caught, it means this info is not available.
        # You can use the below line to set it to 'None', or simply use 'pass' to do nothing.
        company_info['Accepts Business From'] = None

    companies_info.append(company_info)

print(companies_info)

输出:

[
    {
        'Name': 'A I I Insurance Brokerage of Massachusetts Inc', 
        'Address': '183 Davis St, East Douglas, Massachusetts, U S A, MA 01516-113', 
        'Website': 'http://www.agencyint.com', 
        'Accepts Business From': 'U.S.A'
    }, 
    {
        'Name': 'ABCO Insurance Underwriters Inc', 
        'Address': 'ABCO Building, 350 Sevilla Avenue, Suite 201, Coral Gables, Florida, U S A, 33134', 
        'Website': 'http://www.abcoins.com', 
        'Accepts Business From': 'U.S.A'
    }, 
    {
        'Name': 'Acadian Managers, LLC', 
        'Address': '8550 United Plaza Boulevard, Suite 702, Baton Rouge, Louisiana, U.S.A, 70809', 
        'Website': 'http://www.acadianmanagers.com', 
        'Accepts Business From': None
    }
]