使用scrapy提取XHR请求?

时间:2014-11-18 15:54:30

标签: xmlhttprequest web-scraping scrapy

我试图抓住使用javascript生成的社交类似计数。如果我绝对引用XHR网址,我能够抓取所需的数据。但我试图抓取的网站动态生成这些XMLHttpRequests与查询字符串参数,我不知道如何提取。

例如,您可以看到使用每个页面唯一的m,p,i和g参数来构造请求URL。

Query String Parameters

这是汇编的网址:

http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829

..返回此JSON:

{"twitter":13325,"facebook":23481,"googleplusone":964,"disqus":272}

使用以下脚本,我能够从刚才提到的请求网址中提取所需数据(在本例中为twitter计数),但仅针对该特定页面。

import scrapy

from aeon.items import AeonItem
import json
from scrapy.http.request import Request

class AeonSpider(scrapy.Spider):
    name = "aeon"
    allowed_domains = ["aeon.co"]
    start_urls = [
        "http://aeon.co/magazine/technology"
]

def parse(self, response):
    items = []
    for sel in response.xpath('//*[@id="latestPosts"]/div/div/div'):
        item = AeonItem()
        item['title'] = sel.xpath('./a/p[1]/text()').extract()
        item['primary_url'] = sel.xpath('./a/@href').extract() 
        item['word_count'] = sel.xpath('./a/div/span[2]/text()').extract()      

        for each in item['primary_url']:
            yield Request(http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829, callback=self.parse_XHR_data,meta={'item':item})                   


def XHR_data(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    item = response.meta['item']
    item["tw_count"] = jsonresponse["twitter"]  
    yield item    

所以我的问题是,如何提取m,p,i和g url查询参数,以便我可以动态模拟请求网址? (而不是如上所示绝对引用它)

1 个答案:

答案 0 :(得分:2)

这是你如何提取你的网址:

import urlparse
url = 'http://aeon.co/magazine/social/social.php?url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/&m=1385983411&p=1412056831&i=25829&g=http://aeon.co/magazine/?p=25829'

parsed_url = urlparse.parse_qs(urlparse.urlparse(url).query)

for p in parsed_url:
    print p + '=' + parsed_url[p][0]

并输出:

>> python test.py
url=http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/
p=1412056831
m=1385983411
i=25829
g=http://aeon.co/magazine/?p=25829