我试图通过使用scrapy为m-ati.su编写解析器。在第一步,我必须从组合框中获取值和文本字段,其名称为" From"和" To"对于不同的城市。我看了萤火虫的请求并写了
class spider(BaseSpider):
name = 'ati_su'
start_urls = ['http://m-ati.su/Tables/Default.aspx?EntityType=Load']
allowed_domains = ["m-ati.su"]
def parse(self, response):
yield FormRequest('http://m-ati.su/Services/ATIGeoService.asmx/GetGeoCompletionList',
callback=self.ati_from,
formdata={'prefixText': 'moscow', 'count': '10','contextKey':'All_0$Rus'})
def ati_from(self, response):
json = response.body
open('results.txt', 'wb').write(json)
我有" 500内部服务器错误"对于这个请求。我做错了什么?抱歉英文不好。 感谢
答案 0 :(得分:0)
我认为您可能需要在POST请求中添加X-Requested-With: XMLHttpRequest
标头,因此您可以尝试这样做:
def parse(self, response):
yield FormRequest('http://m-ati.su/Services/ATIGeoService.asmx/GetGeoCompletionList',
callback=self.ati_from,
formdata={'prefixText': 'moscow', 'count': '10','contextKey':'All_0$Rus'},
headers={"X-Requested-With": "XMLHttpRequest"})
编辑:我试过运行蜘蛛并带来了这个:
(当我用Firefox检查时,请求体是JSON编码所以我使用Request
并强制使用“POST”方法,我得到的响应在“windows-1251”中被记录了
from scrapy.spider import BaseSpider
from scrapy.http import Request
import json
class spider(BaseSpider):
name = 'ati_su'
start_urls = ['http://m-ati.su/Tables/Default.aspx?EntityType=Load']
allowed_domains = ["m-ati.su"]
def parse(self, response):
yield Request('http://m-ati.su/Services/ATIGeoService.asmx/GetGeoCompletionList',
callback=self.ati_from,
method="POST",
body=json.dumps({
'prefixText': 'moscow',
'count': '10',
'contextKey':'All_0$Rus'
}),
headers={
"X-Requested-With": "XMLHttpRequest",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Content-Type": "application/json; charset=utf-8",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
})
def ati_from(self, response):
jsondata = response.body
print json.loads(jsondata, encoding="windows-1251")