使用scrapy python处理更多文章的post请求

时间:2014-08-30 14:26:38

标签: python web-scraping http-post scrapy

我正在尝试使用scrapy刮取网站, 我的蜘蛛如下:

class mySpider(CrawlSpider):
    name = "mytest"
    allowed_domains = {'www.example.com'}
    start_urls = ['http://www.example.com']

    rules = [
    Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback = 'parse_post',
    follow= True)
    ]

    def parse_post(self, response):
        item = PostItem()

        item['url'] = response.url

        item['title'] = response.xpath('//title/text()').extract()
        item['authors'] =  response.xpath('//span[@class="author"]/text()').extract()

        return item

一切正常但它只会刮擦主页中的链接。它允许加载更多带有帖子请求的文章,即“点击更多文章”。 无论如何我可以模拟加载更多文章按钮来加载文章并继续刮刀吗?

1 个答案:

答案 0 :(得分:2)

“加载更多文章”按钮由javascript管理,点击ti即可启动AJAX帖子请求。

换句话说,这是Scrapy无法轻易处理的事情。

但是,如果Scrapy不是必需的,那么这是使用requestsBeautifulSoup的解决方案:

from bs4 import BeautifulSoup
import requests


url = "http://www.ijreview.com/wp-admin/admin-ajax.php"
session = requests.Session()
page_size = 24

params = {
    'action': 'load_more',
    'numPosts': page_size,
    'category': '',
    'orderby': 'date',
    'time': ''
}

offset = 0
limit = 100
while offset < limit:
    params['offset'] = offset
    response = session.post(url, data=params)
    links = [a['href'] for a in BeautifulSoup(response.content).select('li > a')]
    for link in links:
        response = session.get(link)
        page = BeautifulSoup(response.content)
        title = page.find('title').text.strip()
        author = page.find('span', class_='author').text.strip()
        print {'link': link, 'title': title, 'author': author}

    offset += page_size

打印:

{'author': u'Kevin Boyd', 'link': 'http://www.ijreview.com/2014/08/172770-president-obama-realizes-world-messy-place-thanks-social-media/', 'title': u'President Obama Calls The World A Messy Place & Blames Social Media for Making People Take Notice'}
{'author': u'Reid Mene', 'link': 'http://www.ijreview.com/2014/08/172405-17-politicians-weird-jobs-time-office/', 'title': u'12 Most Unusual Professions of Politicians Before They Were Elected to Higher Office'}
{'author': u'Michael Hausam', 'link': 'http://www.ijreview.com/2014/08/172653-video-duty-mp-fakes-surrender-shoots-hostage-taker/', 'title': u'Video: Off-Duty MP Fake Surrenders at Gas Station Before Revealing Deadly Surprise for Hostage Taker'}
...

您可能需要调整代码,以便它支持不同的类别,排序等。您还可以通过允许BeautifulSoup使用lxml解析器来改进html解析速度 - 而不是BeautifulSoup(response.content)的{​​{1}},请使用BeautifulSoup(response.content, "lxml"),但您需要安装lxml


这是你如何调整Scrapy的解决方案:

import urllib
from scrapy import Item, Field, Request, Spider

class mySpider(Spider):
    name = "mytest"
    allowed_domains = {'www.ijreview.com'}

    def start_requests(self):
        page_size = 25
        headers = {'User-Agent': 'Scrapy spider',
                   'X-Requested-With': 'XMLHttpRequest',
                   'Host': 'www.ijreview.com',
                   'Origin': 'http://www.ijreview.com',
                   'Accept': '*/*',
                   'Referer': 'http://www.ijreview.com/',
                   'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
        for offset in (0, 200, page_size):
            yield Request('http://www.ijreview.com/wp-admin/admin-ajax.php',
                          method='POST',
                          headers=headers,
                          body=urllib.urlencode(
                              {'action': 'load_more',
                               'numPosts': page_size,
                               'offset': offset,
                               'category': '',
                               'orderby': 'date',
                               'time': ''}))

    def parse(self, response):
        for link in response.xpath('//ul/li/a/@href').extract():
            yield Request(link, callback=self.parse_post)

    def parse_post(self, response):
        item = PostItem()

        item['url'] = response.url
        item['title'] = response.xpath('//title/text()').extract()[0].strip()
        item['authors'] =  response.xpath('//span[@class="author"]/text()').extract()[0].strip()

        return item

输出:

{'authors': u'Kyle Becker',
 'title': u'17 Reactions to the \u2018We Don\u2019t Have a Strategy\u2019 Gaffe That May Haunt the Rest of Obama\u2019s Presidency',
 'url': 'http://www.ijreview.com/2014/08/172569-25-reactions-obamas-dont-strategy-gaffe-may-haunt-rest-presidency/'}

...