使用scrapy

时间:2016-08-31 04:44:45

标签: ajax python-2.7 scrapy

我是scrapy和python的新手。我使用scrapy来抓取数据。

该网站使用AJAX进行分页,因此我无法获取超过10条记录的数据我发布了我的代码

from scrapy import Spider
from scrapy.selector import Selector
from scrapy import Request
from justdial.items import JustdialItem
import csv
from itertools import izip
import scrapy
import re

class JustdialSpider(Spider):
    name = "JustdialSpider"
    allowed_domains = ["justdial.com"]
    start_urls = [
        "http://www.justdial.com/Mumbai/Dentists/ct-385543",
    ]

    def start_requests(self):
        headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
        for url in self.start_urls:
            yield Request(url, headers=headers)

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]')
        for question in questions:
            item = JustdialItem()
            item['name'] = question.xpath(
                '//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()').extract()
            item['contact'] = question.xpath(
                '//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[@class="contact-info"]/span/a/b/text()').extract()
            with open('some.csv', 'wb') as f:
                writer = csv.writer(f)
                writer.writerows(izip(item['name'], item['contact']))
                f.close()
        return item

    # if running code above this I'm able to get 10 records of the page

    # This code not  working for getting data more than 10 records, Pagination using AJAX 
    url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate='
    next_page = int(re.findall('page=(\d+)', url)[0]) + 1
    next_url = re.sub('page=\d+', 'page={}'.format(next_page), url)
    print next_url

    def parse_ajaxurl(self, response):
        # e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
        my_headers = {'Referer': response.url}
        yield Request("ajax_request_url",
                      headers=my_headers,
                      callback=self.parse_ajax)

请帮帮我

感谢。

1 个答案:

答案 0 :(得分:1)

实际上,如果您在查看网页时停用了javascript,您会注意到该网站提供了传统的分页,而不是"永远不会结束" AJAX one。

使用此功能,您只需找到下一页的网址并继续:

def parse(self, response):
    questions = response.xpath('//div[contains(@class,"store-details")]')
    for question in questions:
        item = dict()
        item['name'] = question.xpath("h4/span/a/text()").extract_first()
        item['contact'] = question.xpath("p[@class='contact-info']//b/text()").extract_first()
        yield item
    # next page
    next_page = response.xpath("//a[@rel='next']/@href").extract_first()
    if next_page:
        yield Request(next_page)

我也修复了你的xpaths,但总的来说,唯一改变的是# next page评论下的那3行。 作为旁注,我注意到你在蜘蛛中保存到csv,你可以使用内置的scrapy exporter命令,如: scrapy crawl myspider --output results.csv