我使用F12(Chrome)和邮递员检查请求及其在网站上的详细信息
(电子邮件:jianguo.bai@hirebigdata.cn,密码:wsc111111),然后转到
我希望获得Hynuza所关注的所有列,目前为105。打开页面时,只有20个页面,然后我需要向下滚动才能获得更多内容。每次我向下滚动请求的详细信息都是这样的:
Remote Address:60.28.215.70:80
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4
Connection:keep-alive
Content-Length:157
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1
Host:www.zhihu.com
Origin:http://www.zhihu.com
Referer:http://www.zhihu.com/people/hynuza/columns/followed
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36
X-Requested-With:XMLHttpRequest
Form Dataview sourceview URL encoded
method:next
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"}
_xsrf:f1460d2580fbf34ccd508eb4489f1097
然后我使用postman模拟这样的请求:
正如你所看到的,它想要我想要的,甚至我也注册了这个网站。
根据所有这些,我像这样写我的蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
import urllib
from scrapy.http import Request
class PostSpider(scrapy.Spider):
name = "post"
allowed_domains = ["zhihu.com"]
start_urls = (
'http://www.zhihu.com',
)
def __init__(self):
super(PostSpider, self).__init__()
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'email': 'jianguo.bai@hirebigdata.cn', 'password': 'wsc111111'},
callback=self.login,
)
def login(self, response):
yield Request("http://www.zhihu.com/people/hynuza/columns/followed",
callback=self.parse_followed_columns)
def parse_followed_columns(self, response):
# here deal with the first 20 divs
params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"}
method = 'next'
_xsrf = 'f1460d2580fbf34ccd508eb4489f1097'
data = {
'params': params,
'method': method,
'_xsrf': _xsrf,
}
r = Request(
"http://www.zhihu.com/node/ProfileFollowedColumnsListV2",
method='POST',
body=urllib.urlencode(data),
headers={
'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cache-Control': 'no-cache',
'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; '
'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; '
'__utmt=1; '
'__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; '
'__utmb=51854390.2.10.1419902703; '
'__utmc=51854390; '
'__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;'
'__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) '
'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36',
'host': 'www.zhihu.com',
'Origin': 'http://www.zhihu.com',
'Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest',
},
callback=self.parse_more)
r.headers['Cookie'] += response.request.headers['Cookie']
print r.headers
yield r
print "after"
def parse_more(self, response):
# here is where I want to get the returned divs
print response.url
followers = response.xpath("//div[@class='zm-profile-card "
"zm-profile-section-item zg-clear no-hovercard']")
print len(followers)
然后我这样得了403:
2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed)
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed
所以它永远不会进入parse_more
。
我已经工作了两天但仍然一无所获,任何帮助或建议都将受到赞赏。
答案 0 :(得分:0)
登录顺序正确。但是parsed_followed_columns()
方法完全破坏了会话。
您不能对数据使用硬编码值[&#39; _xsrf&#39;]和参数[&#39; hash_id&#39;]
您应该找到一种方法直接从上一页的html内容中读取此信息并动态注入值。
另外,我建议您删除此请求中的headers参数,这只会导致麻烦。