Question

我对scrapy很新，并且养了几只蜘蛛。我试图从这个page中删除评论。到目前为止，我的蜘蛛抓住了第一页并刮掉了这些物品，但是当它出现在分页时，它并没有跟随链接。

我知道发生这种情况是因为它是一个Ajax请求，但是它是一个POST而不是GET是关于这些的新手，但是我读了this。我已阅读此帖here并按照“迷你教程”从响应中获取网址

http://www.pcguia.pt/category/reviews/sorter=recent&location=&loop=main+loop&action=sort&view=grid&columns=3&paginated=2&currentquery%5Bcategory_name%5D=reviews

但是当我尝试在浏览器上打开它时会说

“Págininaoencontrada”=“PAGE NOT FOUND”

到目前为止，我正在思考，我缺少什么？

编辑：我的蜘蛛：

import scrapy
import json
from scrapy.http import FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem

class PcguiaSpider(scrapy.Spider):
    name = "pcguia" #spider name to call in terminal
    allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
    start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
    page_incr = 1
    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

    def parse(self, response):

        sel = Selector(response)

        if self.page_incr > 1:
            json_data = json.loads(response.body)
            sel = Selector(text=json_data.get('content', ''))


        hxs = Selector(response)

        item_pub = ReviewItem()

        item_pub['date']= hxs.xpath('//span[@class="date"]/text()').extract() # is in the format year-month-dayThours:minutes:seconds-timezone ex: 2015-03-31T09:40:00-0700


        item_pub['title'] = hxs.xpath('//title/text()').extract()

        #pagination code starts here 
        # if page has content
        if sel.xpath('//div[@class="panel-wrapper"]'):
            self.page_incr +=1
            formdata = {
                        'sorter':'recent',
                        'location':'main loop',
                        'loop':'main loop',
                        'action':'sort',
                        'view':'grid',
                        'columns':'3',
                        'paginated':str(self.page_incr),
                        'currentquery[category_name]':'reviews'
                        }
            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
        else:
            return

        yield item_pub

输出：

2015-05-12 14:53:45+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: pcguia)
2015-05-12 14:53:45+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:53:45+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pcguia.spiders', 'SPIDER_MODULES': ['pcguia.spiders'], 'BOT_NAME': 'pcguia'}
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled item pipelines: 
2015-05-12 14:53:45+0100 [pcguia] INFO: Spider opened
2015-05-12 14:53:45+0100 [pcguia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6033
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6090
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Crawled (200) <GET http://www.pcguia.pt/category/reviews/#paginated=1> (referer: None)
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/category/reviews/>
    {'date': '',
     'title': [u'Reviews | PCGuia'],
}
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: http://www.pcguia.pt/category/reviews/)
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
    {'date': ''
     'title': ''
}

Answer 1

你可以试试这个

from scrapy.http import FormRequest
from scrapy.selector import Selector
# other imports

class SpiderClass(Spider)
    # spider name and all
    page_incr = 1
    pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'

    def parse(self, response):

        sel = Selector(response)

        if page_incr > 1:
            json_data = json.loads(response.body)
            sel = Selector(text=json_data.get('content', ''))

        # your code here 

        #pagination code starts here 
        # if page has content
        if sel.xpath('//div[@class="panel-wrapper"]'):
            self.page_incr +=1
            formdata = {
                    'sorter':'recent',
                    'location':'main loop',
                    'loop':'main loop',
                    'action':'sort',
                    'view':'grid',
                    'columns':'3',
                    'paginated':str(self.page_incr),
                    'currentquery[category_name]':'reviews'
                    }
            yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
        else:
            return

我已经测试过使用scrapy shell及其工作，

在scrapy Shell中

In [0]: response.url
Out[0]: 'http://www.pcguia.pt/category/reviews/#paginated=1'

    In [1]: from scrapy.http import FormRequest

In [2]: from scrapy.selector import Selector

In [3]: import json

In [4]: response.xpath('//h2/a/text()').extract()
Out[4]: 
        [u'HP Slate 8 Plus',
         u'Astro A40 +MixAmp Pro',
         u'Asus ROG G751J',
         u'BQ Aquaris E5 HD 4G',
         u'Asus GeForce GTX980 Strix',
         u'AlienTech BattleBox Edition',
         u'Toshiba Encore Mini WT7-C',
         u'Samsung Galaxy Note 4',
         u'Asus N551JK',
         u'Western Digital My Passport Wireless',
         u'Nokia Lumia 735',
         u'Photoshop Elements 13',
         u'AMD Radeon R9 285',
         u'Asus GeForce GTX970 Stryx',
         u'TP-Link AC750 Wifi Repeater']

In [5]: url = "http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php"

In [6]: formdata = {
        'sorter':'recent',
        'location':'main loop',
        'loop':'main loop',
        'action':'sort',
        'view':'grid',
        'columns':'3',
        'paginated':'2',
        'currentquery[category_name]':'reviews'
        }

In [7]: r = FormRequest(url=url, formdata=formdata)

In [8]: fetch(r)
        2015-05-12 18:29:16+0530 [default] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: None)
        [s] Available Scrapy objects:
        [s]   crawler    <scrapy.crawler.Crawler object at 0x7fcc247c4590>
        [s]   item       {}
        [s]   r          <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
        [s]   request    <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
        [s]   response   <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
        [s]   settings   <scrapy.settings.Settings object at 0x7fcc2a74f450>
        [s]   spider     <Spider 'default' at 0x7fcc239ba990>
        [s] Useful shortcuts:
        [s]   shelp()           Shell help (print this help)
        [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
        [s]   view(response)    View response in a browser

In [9]: json_data = json.loads(response.body)

In [10]: sell = Selector(text=json_data.get('content', ''))

In [11]: sell.xpath('//h2/a/text()').extract()
Out[11]: 
        [u'Asus ROG GR8',
         u'Devolo dLAN 1200+',
         u'Yezz Billy 4,7',
         u'Sony Alpha QX1',
         u'Toshiba Encore2 WT10',
         u'BQ Aquaris E5 FullHD',
         u'Toshiba Canvio AeroMobile',
         u'Samsung Galaxy Tab S 10.5',
         u'Modecom FreeTab 7001 HD',
         u'Steganos Online Shield VPN',
         u'AOC G2460PG G-Sync',
         u'AMD Radeon R7 SSD',
         u'Nvidia Shield',
         u'Asus ROG PG278Q GSync',
         u'NOX Krom Kombat']

修改

import scrapy import json from scrapy.http import FormRequest, Request from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from pcguia.items import ReviewItem from dateutil import parser import re class PcguiaSpider(scrapy.Spider): name = "pcguia" #spider name to call in terminal allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling page_incr = 1 pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php' def parse(self, response): sel = Selector(response) if self.page_incr > 1: json_data = json.loads(response.body) sel = Selector(text=json_data.get('content', '')) review_links = sel.xpath('//h2/a/@href').extract() for link in review_links: yield Request(url=link, callback=self.parse_review) #pagination code starts here # if page has content if sel.xpath('//div[@class="panel-wrapper"]'): self.page_incr +=1 formdata = { 'sorter':'recent', 'location':'main loop', 'loop':'main loop', 'action':'sort', 'view':'grid', 'columns':'3', 'paginated':str(self.page_incr), 'currentquery[category_name]':'reviews' } yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse) else: return def parse_review(self, response): month_matcher = 'novembro|janeiro|agosto|mar\xe7o|fevereiro|junho|dezembro|julho|abril|maio|outubro|setembro' month_dict = {u'abril': u'April', u'agosto': u'August', u'dezembro': u'December', u'fevereiro': u'February', u'janeiro': u'January', u'julho': u'July', u'junho': u'June', u'maio': u'May', u'mar\xe7o': u'March', u'novembro': u'November', u'outubro': u'October', u'setembro': u'September'} review_date = response.xpath('//span[@class="date"]/text()').extract() review_date = review_date[0].strip().strip('Publicado a').lower() if review_date else '' month = re.findall('%s'% month_matcher, review_date)[0] _date = parser.parse(review_date.replace(month, month_dict.get(month))).strftime('%Y-%m-%dT%H:%M:%T') title = response.xpath('//h1[@itemprop="itemReviewed"]/text()').extract() title = title[0].strip() if title else '' item_pub = ReviewItem( date=_date, title=title) yield item_pub

<强>输出

{'date': '2014-11-05T00:00:00', 'title': u'Samsung Galaxy Tab S 10.5'}

Answer 2

适当的解决方案是使用 selenium 。看到您遇到的问题是新的源代码没有在您的scrapy蜘蛛中更新。

Selenium将帮助您点击后续链接并将更新的源代码传递到 response.xpath 。

如果您只是分享您正在使用的scrapy代码，我可以为您提供更多帮助。

Scrapy遵循分页AJAX请求 - POST

2 个答案: