Python 3.6-Scrapy 1.5
我正在刮擦John Deere保修页面,以查看所有新的PMP及其到期日期。在浏览器和网页之间的网络通信内部,我发现了一个REST API,可在网页中提供数据。
现在,我正在尝试从API获取json数据,而不是抓取javascript页面的内容。但是,出现内部服务器错误,我也不知道为什么。
我正在使用scrapy登录并捕获数据。
import scrapy
class PmpSpider(scrapy.Spider):
name = 'pmp'
start_urls = ['https://jdwarrantysystem.deere.com/portal/']
def parse(self, response):
self.log('***Form Request***')
login ={
'USERNAME':*******,
'PASSWORD':*******
}
yield scrapy.FormRequest.from_response(
response,
url = 'https://registration.deere.com/servlet/com.deere.u90950.registrationlogin.view.servlets.SignInServlet',
method = 'POST', formdata = login, callback = self.parse_pmp
)
self.log('***PARSE LOGIN***')
def parse_pmp(self, response):
self.log('***PARSE PMP***')
cookies = response.headers.getlist('Set-Cookie')
for cookie in cookies:
cookie = cookie.decode('utf-8')
self.log(cookie)
cook = cookie.split(';')[0].split('=')[1]
path = cookie.split(';')[1].split('=')[1]
domain = cookie.split(';')[2].split('=')[1]
yield scrapy.Request(
url = 'https://jdwarrantysystem.deere.com/api/pip-products/collection',
method = 'POST',
cookies = {
'SESSION':cook,
'path':path,
'domain':domain
},
headers = {
"Accept":"application/json",
"accounts":["201445","201264","201167","201342","201341","201221"],
"excludedPin":"",
"export":"",
"language":"",
"metric":"Y",
"pipFilter":"OPEN",
"pipType":["MALF","SAFT"]
},
meta = {'dont_redirect': True},
callback = self.parse_pmp_list
)
def parse_pmp_list(self, response):
self.log('***LISTA PMP***')
self.log(response.body)
为什么会出现错误?如何从此API获取数据?
2018-07-05 17:26:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 1 times): 500 Internal Server Error
2018-07-05 17:26:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 2 times): 500 Internal Server Error
2018-07-05 17:26:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 3 times): 500 Internal Server Error
2018-07-05 17:26:21 [scrapy.core.engine] DEBUG: Crawled (500) <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (referer: https://jdwarrantysystem.deere.com/portal/)
2018-07-05 17:26:21 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://jdwarrantysystem.deere.com/api/pip-products/collection>: HTTP status code is not handled or not allowed
答案 0 :(得分:0)
我发现了问题:这是一个POST请求,必须具有json格式的主体数据,因为与GET请求不同,参数不在URI中。请求标头也需要Imports System.Data.SqlClient
Imports System.Data.OleDb
Public Class Form1
Dim myDA As OleDbDataAdapter
Dim myDataSet As DataSet
Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
Dim con As OleDbConnection = New OleDbConnection("Provider=Microsoft.jet.oledb.4.0;data source=C:\Users\Ryan\Desktop\Coding\Microsoft Access\Powerful Access Files\Nwind.mdb")
Dim cmd As OleDbCommand = New OleDbCommand("SELECT * FROM Customers", con)
con.Open()
myDA = New OleDbDataAdapter(cmd)
'Automatically generates DeleteCommand,UpdateCommand and InsertCommand for DataAdapter object
Dim builder As OleDbCommandBuilder = New OleDbCommandBuilder(myDA)
myDataSet = New DataSet()
myDA.Fill(myDataSet, "MyTable")
DataGridView2.DataSource = myDataSet.Tables("MyTable").DefaultView
con.Close()
con = Nothing
End Sub
Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
Me.Validate()
Me.myDA.Update(Me.myDataSet.Tables("MyTable"))
End Sub
End Class
。请参阅:How parameters are sent in POST request和Rest POST in python。因此,编辑函数parse_pmp:
"content-type": "application/json"
一切正常!