Scrapy没有抓取或抓取像seatgeek / vividseats这样的网站

时间:2018-02-27 15:17:02

标签: python web-scraping scrapy

我正试图从seatgeek抓取票信息,但我正在努力这样做。当我运行我的代码时,我得到了这个:

INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

我的想法是,我会输入节目/活动的名称,scrapy会刮掉节目的每个表演的网址,然后刮取票价等。我的代码如下:

import scrapy
from seatgeek import items

class seatgeekSpider(scrapy.Spider):
    name = "seatgeek_spider"
    showname = input("Enter Show name (lower case please): ")
    showname = showname.replace(' ', '-')
    start_urls = "https://seatgeek.com/" + showname + "-tickets.html"

    def parse_performance(self, response):
        for href in response.xpath('//a[@class="event-listing-title"]/@href').extract():
            yield scrapy.Request(
                url= 'https://seatgeek.com/' + href,
                callback=self.parse_ticketinv,
                method="POST",
                meta={'url': href})

    def parse_ticketinv(self, response):

        price = response.xpath('//span[@class="omnibox__listing__buy__price"]').extract()
        performance = response.xpath('//div[@class="event-detail-words faint-words"]/text()').extract()
        quantity = response.xpath('//div[@class="omnibox__seatview__availability"]/text()').extract()
        seatinfo = response.xpath('//div[@class="omnibox__listing__section"]/text()').extract()

        # creating scrapy items
        item = items.seatgeekItem()
        item['price'] = price
        item['performance'] = performance
        item['quantity'] = quantity
        item['seatinfo'] = seatinfo

        yield item

这是我的items.py代码:

import scrapy

class SeatgeekItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    price = scrapy.Field()
    performnace = scrapy.Field()
    quantity = scrapy.Field()
    seatinfo = scrapy.Field()

非常感谢任何帮助 - 谢谢!

1 个答案:

答案 0 :(得分:1)

我可以看到两个直接的问题:

  • start_urls应该是一个列表;你应该看到这样的错误:

    Traceback (most recent call last):
    (...)
        raise ValueError('Missing scheme in request url: %s' % self._url)
    ValueError: Missing scheme in request url: h
    
  • 默认情况下,start_urls中用于网址的回调是parse(),这在您的代码中未定义。也许您应该重命名parse_performance()方法?

此外,spider arguments是获取用户输入的更常见方式。