如何将URL动态添加到start_urls

时间:2018-07-17 22:47:11

标签: python web-scraping scrapy

我正在尝试从亚马逊抓取产品信息,并且遇到了问题。当爬虫到达页面的末尾时,它停止了,我想为我的程序添加一种方法来一般搜索页面的后3页。我正在尝试编辑start_urls,但无法从函数解析中进行此操作。同样,这没什么大不了,但是由于某种原因,程序要求两次相同的信息。预先感谢。

import scrapy
from scrapy import Spider
from scrapy import Request

class ProductSpider(scrapy.Spider):
    product = input("What product are you looking for? Keywords help for specific products: ")
    name = "Product_spider"
    allowed_domains=['www.amazon.ca']
    start_urls = ['https://www.amazon.ca/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords='+product]
    #so that websites will not block access to the spider
    download_delay = 30
    def parse(self, response):
        temp_url_list = []
        for i in range(3,6):
            next_url = response.xpath('//*[@id="pagn"]/span['+str(i)+']/a/@href').extract()
            next_url_final = response.urljoin(str(next_url[0]))
            start_urls.append(str(next_url_final))
        # xpath is similar to an address that is used to find certain elements in HTML code,this info is then extracted
        product_title = response.xpath('//*/div/div/div/div[2]/div[1]/div[1]/a/@title').extract()
        product_price = response.xpath('//span[contains(@class,"s-price")]/text()').extract()
        product_url = response.xpath('//*/div/div/div/div[2]/div[1]/div[1]/a/@href').extract()
        # yield goes through everything once, saves its spot, does not save info but sends it to the pipeline to get processed if need be
        yield{'product_title': product_title, 'product_price': product_price, 'url': product_url,}
        # repeating the same process on concurrent pages

                                  #it is checking the same url, no generality, need to find, maybe just do like 5 pages, also see if you can have it sort from high to low and find match with certain amount of key words

2 个答案:

答案 0 :(得分:1)

您误解了这里的刮y效果。

Scrapy希望您的Spider产生(产量)scrapy.Request对象或scrapy.Item / dictionary对象。 蜘蛛启动时,它会从start_urls中获取网址,并为其中的每个网址生成一个scrapy.Request

def start_request(self, parse):
    for url in self.start_urls:
        yield scrapy.Request(url)

因此,一旦蜘蛛启动,更改start_urls就不会改变。

但是,您可以做的只是在scrapy.Requests方法中产生更多parse()

def parse(self, response):
    urls = response.xpath('//a/@href').extract()
    for url in urls:
        yield scrapy.Request(url, self.parse2)

def parse2(self, response):
    # new urls!

答案 1 :(得分:0)

您可以覆盖__init__方法,只需使用-a选项传递URL。请参阅草率文档中的Spider arguments

class QuotesSpider(scrapy.Spider):
   name = "quotes"

   def __init__(self, urls=[], *args, **kwargs):
       self.start_urls = urls.split(',')
       super(QuotesSpider, self).__init__(*args, **kwargs)

像这样运行它:

scrapy crawl quotes -a "urls=http://quotes.toscrape.com/page/1/,http://quotes.toscrape.com/page/2/"