当我使用scrapy发布时,为什么我的formdata会出错?

时间:2017-11-10 09:10:57

标签: python scrapy http-post decode

我想使用scrapy和python 2.7.11刺激FormRequest抓取http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018

这是我的代码:

<<-

我需要将以下数据发布到表单:

java -Dspring.profiles.active=local -jar  D:/hello/myjar.jar

正确抓取页面。

但是我的回答什么都没有,所以我用fiddler找到发布的数据表格,它是:

def start_requests(self):         
    posturl = 'http://www.istic.ac.cn/suoguan/essearch.ashx'
    url = 'http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018'
    journalId = re.search(r'journalId=(.*?)&', url).group(1)
    yearNum = re.search(r'&yp=(\d+)', url).group(1)
    postdata = {
    "indexname" : "xw_qk", 
    "search" : "{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum),
    "page" : "0",
    "pagenum" : "20",
    "sort" : "",
    "type" : "content",
    }
    print journalId, yearNum
    print postdata
    self.logger.info('Visit_headpage........................')
    yield FormRequest(posturl,  formdata = postdata, callback = self.parse_item)  

所以这意味着这三个信号的解码错误:&#39;(&#39;,&#39;)&#39;,&#39; *&#39;。
但是当我在scrapy日志中打印formdata时,它仍然是正确的格式:

indexname=xw_qk&
search=IELEP0229%2F(F_ReqNum)*2018%2F(F_YEAR)
&page=0&pagenum=20&sort=&type=content

那我怎么解决呢?

2 个答案:

答案 0 :(得分:0)

我建议使用Request(method='POST')代替FormRequest(),因为我在使用此功能时遇到了很多麻烦。

并尝试将params直接附加到posturl这样的

yield Request(url= posturl + "?search="+"{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum, method='POST')

并连接其他参数,

答案 1 :(得分:0)

他们正在发送相同的内容(scrapy的FormRequest只是网址编码)但我认为它发生的是当你首先登陆{{1}时需要收到cookie请尝试以下方法:

http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018

应输出 # -*- coding: utf-8 -*- import json import re import scrapy from scrapy import FormRequest class IsticSpider(scrapy.Spider): name = "istic" allowed_domains = ["istic.ac.cn"] start_urls = ['http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018'] def parse(self, response): posturl = 'http://www.istic.ac.cn/suoguan/essearch.ashx' journalId = re.search(r'journalId=(.*?)&', response.url).group(1) yearNum = re.search(r'&yp=(\d+)', response.url).group(1) postdata = { "indexname" : "xw_qk", "search" : "{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum), "page" : "0", "pagenum" : "20", "sort" : "", "type" : "content", } yield FormRequest(posturl, formdata = postdata, callback = self.parse_item) def parse_item(self, response): data = json.loads(response.body_as_unicode()) self.logger.debug('%s', data.keys())