Webcrawler:读取具有特定父母的html标签

时间:2018-07-27 11:03:37

标签: python-3.x beautifulsoup web-crawler

我想对所有ID进行爬网,以将其一个接一个地设置到my_url中。要获得li的数据ID可以正常工作,但我也想获得一个数据公开的ID。我尝试了不同的方法,但没有任何效果。

html:

<ul id="resultListItems" class="is24-res-list is24-res-gallery result-list border-top">
    <li class="result-list__listing result-list__listing--xl" data-id="102292896">
        <div>
            <article data-item="result" id="result-102292896" data-obid="102292896" class="result-list-entry result-list-entry--xl result-list-entry--project result-list-entry--with-logo" data-listing-size="XL">
                <div class="result-list-entry__grouped-listings">
                    <div class="slick-initialized slick-slider">
                        <div aria-live="polite" class="slick-list draggable">
                            <div class="slick-track" style="opacity: 1; width: 356px; transform: translate3d(0px, 0px, 0px);">
                                <div class="grouped-listing slick-slide slick-current slick-active grouped-listing--active" style="width: 162px;" data-slick-index="0" aria-hidden="false">
                                    <a href="/expose/102292896" id="result-102292896" data-go-to-expose-id="102292896" data-go-to-expose-referrer="RESULT_LIST_GROUPED">

脚本:

(...)
try:
    get_id = soup(url, "html.parser")

    for biglist in get_id.find_all("li", {"data-id": True}):
        if (biglist.parent.get("id") == "resultListItems"):
            my_url = "https://www.abc.de/"+biglist.get("data-id")+"#/"
            (...)

这部分效果很好,但是下一部分效果不好。

    for list1 in get_id.find_all("a", {"data-go-to-expose-id": True}):
        if (list1.parent("div", "class") == "grouped-listing"):
            my_url2 = "https://www.abc.de/"+list1.get("data-go-to-expose-id")+"#/"
            (...)

我该怎么做,它首先搜索“ li” -ID,然后搜索“ a” -ID?第二部分未找到任何结果。也许是因为父div类ist比“分组列表”还重要?

1 个答案:

答案 0 :(得分:0)

要搜索属性为<a>且其父级为"data-go-to-expose-id"且类为<div>的属性"grouped-listing"的所有'div.grouped-listing a[data-go-to-expose-id]' 标记,可以使用CSS选择器

data = """
<ul id="resultListItems" class="is24-res-list is24-res-gallery result-list border-top">
    <li class="result-list__listing result-list__listing--xl" data-id="102292896">
        <div>
            <article data-item="result" id="result-102292896" data-obid="102292896" class="result-list-entry result-list-entry--xl result-list-entry--project result-list-entry--with-logo" data-listing-size="XL">
                <div class="result-list-entry__grouped-listings">
                    <div class="slick-initialized slick-slider">
                        <div aria-live="polite" class="slick-list draggable">
                            <div class="slick-track" style="opacity: 1; width: 356px; transform: translate3d(0px, 0px, 0px);">
                                <div class="grouped-listing slick-slide slick-current slick-active grouped-listing--active" style="width: 162px;" data-slick-index="0" aria-hidden="false">
                                    <a href="/expose/102292896" id="result-102292896" data-go-to-expose-id="102292896" data-go-to-expose-referrer="RESULT_LIST_GROUPED">
"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for a in soup.select('div.grouped-listing a[data-go-to-expose-id]'):
    my_url2 = "https://www.abc.de/"+a['data-go-to-expose-id']+"#/"
    print(my_url2)

,像这样:

https://www.abc.de/102292896#/

此打印:

DOM