Question

我想使用scrapy和python 2.7.11刺激FormRequest抓取http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018

这是我的代码：

<<-

我需要将以下数据发布到表单：

java -Dspring.profiles.active=local -jar  D:/hello/myjar.jar

正确抓取页面。

但是我的回答什么都没有，所以我用fiddler找到发布的数据表格，它是：

def start_requests(self):         
    posturl = 'http://www.istic.ac.cn/suoguan/essearch.ashx'
    url = 'http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018'
    journalId = re.search(r'journalId=(.*?)&', url).group(1)
    yearNum = re.search(r'&yp=(\d+)', url).group(1)
    postdata = {
    "indexname" : "xw_qk", 
    "search" : "{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum),
    "page" : "0",
    "pagenum" : "20",
    "sort" : "",
    "type" : "content",
    }
    print journalId, yearNum
    print postdata
    self.logger.info('Visit_headpage........................')
    yield FormRequest(posturl,  formdata = postdata, callback = self.parse_item)

所以这意味着这三个信号的解码错误：＆＃39;（＆＃39;，＆＃39;）＆＃39;，＆＃39; *＆＃39;。
但是当我在scrapy日志中打印formdata时，它仍然是正确的格式：

indexname=xw_qk&
search=IELEP0229%2F(F_ReqNum)*2018%2F(F_YEAR)
&page=0&pagenum=20&sort=&type=content

那我怎么解决呢？

Answer 1

我建议使用Request(method='POST')代替FormRequest()，因为我在使用此功能时遇到了很多麻烦。

并尝试将params直接附加到posturl这样的

yield Request(url= posturl + "?search="+"{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum, method='POST')

并连接其他参数，

Answer 2

他们正在发送相同的内容（scrapy的FormRequest只是网址编码）但我认为它发生的是当你首先登陆{{1}时需要收到cookie请尝试以下方法：

http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018

应输出# -*- coding: utf-8 -*- import json import re import scrapy from scrapy import FormRequest class IsticSpider(scrapy.Spider): name = "istic" allowed_domains = ["istic.ac.cn"] start_urls = ['http://www.istic.ac.cn/suoguan/QiKan_ShouYe.htm?lan=en&journalId=IELEP0229&yp=2018'] def parse(self, response): posturl = 'http://www.istic.ac.cn/suoguan/essearch.ashx' journalId = re.search(r'journalId=(.*?)&', response.url).group(1) yearNum = re.search(r'&yp=(\d+)', response.url).group(1) postdata = { "indexname" : "xw_qk", "search" : "{0}/F(F_ReqNum)*{1}/F(F_YEAR)".format(journalId, yearNum), "page" : "0", "pagenum" : "20", "sort" : "", "type" : "content", } yield FormRequest(posturl, formdata = postdata, callback = self.parse_item) def parse_item(self, response): data = json.loads(response.body_as_unicode()) self.logger.debug('%s', data.keys())

当我使用scrapy发布时，为什么我的formdata会出错？

2 个答案: