scrapy通过post方法获取数据但得到了403

时间:2014-12-30 03:35:50

标签: ajax post web-scraping scrapy scrapy-spider

我使用F12(Chrome)和邮递员检查请求及其在网站上的详细信息

  

http://www.zhihu.com/

(电子邮件:jianguo.bai@hirebigdata.cn,密码:wsc111111),然后转到

  

http://www.zhihu.com/people/hynuza/columns/followed

我希望获得Hynuza所关注的所有列,目前为105。打开页面时,只有20个页面,然后我需要向下滚动才能获得更多内容。每次我向下滚动请求的详细信息都是这样的:

Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097

然后我使用postman模拟这样的请求:

enter image description here

正如你所看到的,它想要我想要的,甚至我也注册了这个网站。

根据所有这些,我像这样写我的蜘蛛:

# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request


class PostSpider(scrapy.Spider):
    name = "post"
    allowed_domains = ["zhihu.com"]
    start_urls = (
        'http://www.zhihu.com',
    )

    def __init__(self):
        super(PostSpider, self).__init__()

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'email': 'jianguo.bai@hirebigdata.cn', 'password': 'wsc111111'},
            callback=self.login,
        )

    def login(self, response):
        yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
                      callback=self.parse_followed_columns)

    def parse_followed_columns(self, response):
        # here deal with the first 20 divs
        params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
        method = 'next'
        _xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
        data = {
            'params': params,
            'method': method,
            '_xsrf': _xsrf,
        }
        r = Request(
            "http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
            method='POST',
            body=urllib.urlencode(data),
            headers={
                'Accept': '*/*',
                'Accept-Encoding': 'gzip,deflate',
                'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
                'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                'Cache-Control': 'no-cache',
                'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
                          'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
                          '__utmt=1; '
                          '__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
                          '__utmb=51854390.2.10.1419902703; '
                          '__utmc=51854390; '
                          '__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
                          '__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
                'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
                'host': 'www.zhihu.com',
                'Origin': 'http://www.zhihu.com',
                'Connection': 'keep-alive',
                'X-Requested-With': 'XMLHttpRequest',
            },
            callback=self.parse_more)
        r.headers['Cookie'] += response.request.headers['Cookie']
        print r.headers
        yield r
        print "after"

    def parse_more(self, response):
        # here is where I want to get the returned divs
        print response.url
        followers = response.xpath("//div[@class='zm-profile-card "
                                   "zm-profile-section-item zg-clear no-hovercard']")
        print len(followers)

然后我这样得了403:

2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed

所以它永远不会进入parse_more

我已经工作了两天但仍然一无所获,任何帮助或建议都将受到赞赏。

1 个答案:

答案 0 :(得分:0)

登录顺序正确。但是parsed_followed_columns()方法完全破坏了会话。

您不能对数据使用硬编码值[&#39; _xsrf&#39;]和参数[&#39; hash_id&#39;]

您应该找到一种方法直接从上一页的html内容中读取此信息并动态注入值。

另外,我建议您删除此请求中的headers参数,这只会导致麻烦。