我想使用scrapy和python 2.7.11刺激FormRequest抓取http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018
这是我的代码:
<<-
我需要将以下数据发布到表单:
java -Dspring.profiles.active=local -jar D:/hello/myjar.jar
正确抓取页面。
但是我的回答什么都没有,所以我用fiddler找到发布的数据表格,它是:
def start_requests(self):
posturl = 'http://www.istic.ac.cn/suoguan/essearch.ashx'
url = 'http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018'
journalId = re.search(r'journalId=(.*?)&', url).group(1)
yearNum = re.search(r'&yp=(\d+)', url).group(1)
postdata = {
"indexname" : "xw_qk",
"search" : "{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum),
"page" : "0",
"pagenum" : "20",
"sort" : "",
"type" : "content",
}
print journalId, yearNum
print postdata
self.logger.info('Visit_headpage........................')
yield FormRequest(posturl, formdata = postdata, callback = self.parse_item)
所以这意味着这三个信号的解码错误:&#39;(&#39;,&#39;)&#39;,&#39; *&#39;。
但是当我在scrapy日志中打印formdata时,它仍然是正确的格式:
indexname=xw_qk&
search=IELEP0229%2F(F_ReqNum)*2018%2F(F_YEAR)
&page=0&pagenum=20&sort=&type=content
那我怎么解决呢?
答案 0 :(得分:0)
我建议使用Request(method='POST')
代替FormRequest()
,因为我在使用此功能时遇到了很多麻烦。
并尝试将params直接附加到posturl
这样的
yield Request(url= posturl + "?search="+"{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum, method='POST')
并连接其他参数,
答案 1 :(得分:0)
他们正在发送相同的内容(scrapy的FormRequest只是网址编码)但我认为它发生的是当你首先登陆{{1}时需要收到cookie请尝试以下方法:
http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018
应输出 # -*- coding: utf-8 -*-
import json
import re
import scrapy
from scrapy import FormRequest
class IsticSpider(scrapy.Spider):
name = "istic"
allowed_domains = ["istic.ac.cn"]
start_urls = ['http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018']
def parse(self, response):
posturl = 'http://www.istic.ac.cn/suoguan/essearch.ashx'
journalId = re.search(r'journalId=(.*?)&', response.url).group(1)
yearNum = re.search(r'&yp=(\d+)', response.url).group(1)
postdata = {
"indexname" : "xw_qk",
"search" : "{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum),
"page" : "0",
"pagenum" : "20",
"sort" : "",
"type" : "content",
}
yield FormRequest(posturl, formdata = postdata, callback = self.parse_item)
def parse_item(self, response):
data = json.loads(response.body_as_unicode())
self.logger.debug('%s', data.keys())