抓痒的蜘蛛多次刮擦相同的数据

时间:2020-09-21 10:24:16

标签: python web-scraping scrapy

因此,我试图在网上刮刮一些评论。我遍历了一个存放我所有物品的容器,以减少请求数量并使蜘蛛更快。完成整个抓取操作后得到的csv显示,蜘蛛有时会重复收集超过20行或更多的数据。 但是,当我不遍历容器时,蜘蛛程序会非常缓慢,会在几页后停止抓取,但会正确返回数据。

我还通过更改url值生成了下一页,如下代码所示。我这样做是因为html的下一页图标没有href值。

我不知道怎么了。我需要帮助!

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from yelp.items import YelpItem
import urllib.parse
from scrapy.http.request import Request
from scrapy.loader.processors import TakeFirst, MapCompose, Join
from scrapy.linkextractors import LinkExtractor

class RestSpider(scrapy.Spider):
    name = 'rest'
    start_urls = ['https://www.yelp.com/biz/burma-superstar-san-francisco-2?osq=Restaurants/']
    

def parse(self, response):
    pages = [str(i) for i in range(0,6800,20)]
    for page in pages:
        url = 'https://www.yelp.com/biz/burma-superstar-san-francisco-2?osq=Restaurants&start='+page
        yield Request(urllib.parse.urljoin(response.url, url))
        
        
    selectors = response.xpath("//div[contains(@class,'lsidebarActionsHoverTarget')]")
    for selector in selectors:
        yield self.parse_item(selector, response)


def parse_item(self,selector,response):
    l=ItemLoader(item=YelpItem(),selector = response)
    l.default_output_processor = TakeFirst()
    l.add_xpath('date', './/span[@class="lemon--span__373c0__3997G text__373c0__2Kxyz text-color--mid__373c0__jCeOG text-align--left__373c0__2XGa-"]/text()')
    l.add_xpath('location', './/span[@class="lemon--span__373c0__3997G text__373c0__2Kxyz text-color--normal__373c0__3xep9 text-align--left__373c0__2XGa- text-weight--bold__373c0__1elNz text-size--small__373c0__3NVWO"]/text()')
    l.add_xpath('review', './/p[contains(@class, "comment")]/span/text()')
    l.add_xpath('rating', './/div[contains(@class,"i-stars")]/@aria-label')
    return l.load_item()
    

1 个答案:

答案 0 :(得分:1)

问题出在这里

def parse_item(self,selector,response):
    l=ItemLoader(item=YelpItem(),selector = response)

您的代码创建了多个选择器,但是您从不使用它们。相反,您总是将response传递给项目加载器。