scrapy:如何一个一个地同时<ul>和<li>?

时间:2019-12-19 15:05:18

标签: python xpath scrapy response

使用Python scrapy从网页上获取内容,我想按以下顺序获取内容:

客厅,椅子的链接

客厅,沙发的链接

...

床房,床的链接

床房,镜子的链接

...

现在,URL是正确的,但是在sub_cat中打印的所有parse_item_info都是Living room。当我尝试在sub_cat中打印出parse_item时,我得到了所有子类别。

我认为问题在于<li>中的标签<ul>被获取了两次。我怎样才能使他们一对一地正确? 谢谢。

html:

  <div class="row margin-b2">
                    <div class="col">
                        <ul class="list-unstyled">
                                <li class="font-size-3 color-yellow main-list-li">
                                    Living room
                                </li>
                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/536" class="text-light">Chair</a>
                                    </li>
                                    ...
                                </ul>   
                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/537" class="text-light">Sofa</a>
                                    </li>
                                    ...
                                </ul>


                                <li class="font-size-3 color-yellow main-list-li">
                                    Bed room
                                </li>

                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/538" class="text-light">Bed</a>
                                    </li>
                                    ...
                                </ul>   

                                <ul class="list-inline main-list-ul">   
                                    <li class="list-inline-item main-list-li-w align-text-top">
                                            <a href="https://www.website.com/furiture/539" class="text-light">Mirror</a>
                                    </li>
                                    ...
                                </ul>       

                                ...                                                                                                                                                                                                                                                       </ul>

                        </ul>
                    </div>
                </div>

Python:

   def parse_item(self, response):
        cat = response.meta["cat"]
        out_box = response.xpath('//div[@class="row margin-b2"]')

        # get all sub categories first
        sub_cat_arr = []
        for box in out_box.xpath('//li[@class="font-size-3 color-yellow main-list-li"]'):
            sub_cat = box.xpath('./text()').extract()[0].strip()
            sub_cat_arr.append(sub_cat)

        i = 0
        for box in out_box.xpath('//ul[@class="list-inline main-list-ul"]'):
            sub_cat = sub_cat_arr[i]
            i += 1
            print("in......")
            print(sub_cat)
            for url_box in box.xpath('//li[@class="list-inline-item main-list-li-w align-text-top"]//a'):
                new_url = url_box.xpath('.//@href').extract()[0]
                yield scrapy.Request(new_url, meta={"url": new_url, "cat": cat, "sub_cat": sub_cat}, callback=self.parse_item_info)


    def parse_item_info(self, response):
        cat = response.meta["cat"]
        sub_cat = response.meta["sub_cat"]
        url = response.meta["url"]
        print(sub_cat)
        print(url)
        ...

1 个答案:

答案 0 :(得分:0)

为了不对同一标签进行两次处理,您肯定只需要使用一个周期:

    def parse_item(self, response):
    cat = response.meta["cat"]
    for tag in response.css("ul.list-unstyled li.font-size-3.color-yellow.main-list-li, ul.list-inline.main-list-ul li a"):
        if tag.root.tag == "li":
            current_sub_cat = tag.css("*::text").extract_first("").strip("\n ")
        elif tag.root.tag == "a":
            new_url = tag.css("*::attr(href)").extract_first()
            sub_cat = current_sub_cat
            yield scrapy.Request(url=new_url, meta={"new_url": new_url, "sub_cat": sub_cat, "cat": cat}, callback=self.parse_item_info)