Scrapy报告0页被抓取

时间:2017-08-02 22:15:00

标签: python web-scraping scrapy

我试图在代码中搜索网站上的鞋子价格。我不知道我的语法是否正确。我真的可以使用一些帮助。

from scrapy.spider import BaseSpider
from scrapy import Field
from scrapy import Item
from scrapy.selector import HtmlXPathSelector

def Yeezy(Item):
 price = Field()


class YeezySpider(BaseSpider):
  name = "yeezy"
  allowed_domains = ["https://www.grailed.com/"]
  start_url = ['https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2']

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    price = hxs.css('.listing-price .sub-title:nth-child(1) span').extract()
    items = []
    for price in price:
        item = Yeezy()
        item["price"] = price.select(".listing-price .sub-title:nth-child(1) span").extract()
        items.append(item)
    yield item

代码将此报告给控制台:

ScrapyDeprecationWarning: YeezyScrape.spiders.yeezy_spider.YeezySpider     inherits from deprecated class scrapy.spider.BaseSpider, please inherit from      scrapy.spider.Spider. (warning only on first subclass, there may be others)
  class YeezySpider(BaseSpider):
2017-08-02 14:45:25-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-02 14:45:25-0700 [scrapy] INFO: Optional features available: ssl,     http11
2017-08-02 14:45:25-0700 [scrapy] INFO: Overridden settings:     {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'SPIDER_MODULES':     ['YeezyScrape.spiders'], 'BOT_NAME': 'YeezyScrape'}   
2017-08-02 14:45:25-0700 [scrapy] INFO: Enabled extensions: LogStats,     TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled item pipelines: 
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider opened
2017-08-02 14:45:26-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Telnet console listening on     127.0.0.1:6023
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-02 14:45:26-0700 [yeezy] INFO: Closing spider (finished)
2017-08-02 14:45:26-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 127000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 125000)}
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider closed (finished)

Process finished with exit code 0

起初我认为这是我输入的css元素的问题,但现在我不太确定。这是我第一次尝试这样的项目,我真的可以使用一些洞察力。先感谢您。

编辑:所以我尝试通过另一个例子在我的代码中模拟xhr请求。这就是我所拥有的:

import scrapy
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
#from YeezyScrape import YeezyscrapeItem


class YeezySpider(scrapy.Spider):
    name = "yeezy"
    allowed_domains = ["www.grailed.com"]
    start_url = ["https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2"]

    def parse(self, response):
        for i in range(0,2):
            yield FormRequest(url = 'https://mnrwefss2q-
dsn.algolia.net/1/indexes/Listing_production/query?x-algolia-
agent=Algolia%20for%20vanilla%20JavaScript%203.21.1&x-algolia-application-
id=MNRWEFSS2Q&x-algolia-api-key=a3a4de2e05d9e9b463911705fb6323ad', 
method="post", formdata={"params":"query:boost
filters:(strata:'basic' OR strata:'grailed' OR strata:'hype') AND 
(category_path:'footwear.slip_ons' OR category_path:'footwear.sandals' OR 
category_path:'footwear.lowtop_sneakers' OR category_path:'footwear.leather' 
OR category_path:'footwear.hitop_sneakers' OR 
category_path:'footwear.formal_shoes' OR category_path:'footwear.boots') AND 
(marketplace:grailed)
hitsPerPage:40
facets ["strata","size","category","category_size",
 "category_path","category_path_size",
"category_path_root_size","price_i","designers.id",
"location","marketplace"] 
page:2"}, callback=self.data_parse())

def data_parse(self, response):
    hxs = HtmlXPathSelector(response)
    prices = hxs.xpath("//p").extract()
    for prices in prices:
        price = prices.select("a/text()").extract()
        print price

我不得不重新格式化一些东西以适应Python和Stackoverflow之间的缩进差异。

这些是终端中报告的日志,再次感谢您的帮助:

C:\Python27\python.exe C:/Python27/Lib/site-packages/scrapy/cmdline.py crawl yeezy -o price.json
2017-08-04 13:23:27-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-04 13:23:27-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-04 13:23:27-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'FEED_URI': 'price.json', 'BOT_NAME': 'YeezyScrape'}
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled item pipelines: 
2017-08-04 13:23:27-0700 [yeezy] INFO: Spider opened
2017-08-04 13:23:28-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-04 13:23:28-0700 [yeezy] INFO: Closing spider (finished)
2017-08-04 13:23:28-0700 [yeezy] INFO: Dumping Scrapy stats:
    {'finish_reason': 'finished',
     'finish_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 3000),
     'log_count/DEBUG': 2,
     'log_count/INFO': 7,
     'start_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 1000)}
2017-08-04 13:23:28-0700 [yeezy] INFO: Spider closed (finished)

Process finished with exit code 0

1 个答案:

答案 0 :(得分:0)

似乎产品是由AJAX检索的(参见相关:Can scrapy be used to scrape dynamic content from websites that are using AJAX?) 如果您打开浏览器webinspector,选择网络选项卡并在页面加载时查找XHR请求,您可以看到:

firebug network

似乎正在使用类别,过滤器等进行json类型请求,并返回<div class="parent"> <ul> {{#each model as | temp index|}} <li class={{if (eq selectedIndex index) 'highlight' }} {{action 'changeSelectedIndex' index}}>{{temp}}</li> {{/each}} </ul> </div> 个产品。你可以对它进行逆向工程并在scrapy中复制它。