This是我要抓取的页面
页面上的数据来自此URL
这是我的抓取工具的代码。我至少检查了标头和formdata 5次。我认为它们是正确的。问题是,尽管我覆盖了GET
方法的默认行为,但仍难以向start_url
发送parse
请求。
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = [
'https://277kmabdt6-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%20(lite)%203.27.1%3BJS%20Helper%202.26.0%3Bvue-instantsearch%201.7.0&x-algolia-application-id=277KMABDT6&x-algolia-api-key=bf8b92303c2418c9aed3c2f29f6cbdab',
]
formdata = {
'requests': [{'indexName': 'listings',
'params': 'query=&hitsPerPage=24&page=0&highlightPreTag=__ais-highlight__&highlightPostTag=__%2Fais-highlight__&filters=announce_type%3Aproperty-announces%20AND%20language_code%3Apt%20AND%20listing_id%3A%205&facets=%5B%22announce_type%22%5D&tagFilters='}]
}
headers = {
'accept': 'application/json',
'content-type': 'application/x-www-form-urlencoded',
'Origin': 'https://www.flat.com.br',
'Referer': 'https://www.flat.com.br/search?query=',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
}
def parse(self, response):
for url in self.start_urls:
yield scrapy.FormRequest(
url=url,
method='POST',
headers=self.headers,
formdata=self.formdata,
callback=self.parse_page,
)
def parse_page(self, response):
print json.loads(response.text)
这是我运行蜘蛛时收到的消息。
我的问题是;为什么要向网址发送GET
请求,我会丢失一些东西吗?可能是我的请求失败的其他原因吗?
2019-07-01 11:45:58 [scrapy] DEBUG: Crawled (400) <GET https://277kmabdt6-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%20(lite)%203.27.1%3BJS%20Helper%202.26.0%3Bvue-instantsearch%201.7.0&x-algolia-application-id=277KMABDT6&x-algolia-api-key=bf8b92303c2418c9aed3c2f29f6cbdab> (referer: None)
2019-07-01 11:45:58 [scrapy] DEBUG: Ignoring response <400 https://277kmabdt6-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%20(lite)%203.27.1%3BJS%20Helper%202.26.0%3Bvue-instantsearch%201.7.0&x-algolia-application-id=277KMABDT6&x-algolia-api-key=bf8b92303c2418c9aed3c2f29f6cbdab>: HTTP status code is not handled or not allowed
答案 0 :(得分:2)
您需要将parse
方法重命名为start_requests
,因为默认情况下,Scrapy会对GET
中的每个URL进行self.start_urls
:
def start_requests(self):
for url in self.start_urls:
yield scrapy.FormRequest(
url=url,
method='POST',
headers=self.headers,
formdata=self.formdata,
callback=self.parse_page,
)
答案 1 :(得分:1)
我认为,只有当您的有效载荷为body=json.dumps(self.formdata)
而不是formdata=self.formdata
时,您才能获得有效的响应,因为它们是json格式。建议部分应如下所示:
def start_requests(self):
for url in self.start_urls:
yield scrapy.FormRequest(
url=url,method='POST',
headers=self.headers,body=json.dumps(self.formdata),
callback=self.parse_page,
)
当您使用parse()
方法时,默认情况下该方法从start_urls
到get
请求中获取响应,但是在这种情况下,您在start_urls
中使用的网址永远不会通过parse()
方法,因为它会抛出状态400错误或类似的信息。因此,要像尝试的方式一样使用parse()
方法,请确保在url
中使用的start_urls
能够获得所需的状态。也就是说,即使您使用状态为200的另一个url
,然后使用right url
处理发布请求,也可以根据需要进行响应。
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
#different url
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
url = 'https://277kmabdt6-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%20(lite)%203.27.1%3BJS%20Helper%202.26.0%3Bvue-instantsearch%201.7.0&x-algolia-application-id=277KMABDT6&x-algolia-api-key=bf8b92303c2418c9aed3c2f29f6cbdab'
formdata = {
'requests': [{'indexName': 'listings',
'params': 'query=&hitsPerPage=24&page=0&highlightPreTag=__ais-highlight__&highlightPostTag=__%2Fais-highlight__&filters=announce_type%3Aproperty-announces%20AND%20language_code%3Apt%20AND%20listing_id%3A%205&facets=%5B%22announce_type%22%5D&tagFilters='}]
}
headers = {
'accept': 'application/json',
'content-type': 'application/x-www-form-urlencoded',
'Origin': 'https://www.flat.com.br',
'Referer': 'https://www.flat.com.br/search?query=',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
}
def parse(self,response):
yield scrapy.Request(
url=self.url,method='POST',
headers=self.headers,body=json.dumps(self.formdata),
callback=self.parse_page,
)
def parse_page(self, response):
print(json.loads(response.text))
答案 2 :(得分:1)
首先将您的解析方法重命名为:
@IBDesignable extension UIView{
@IBInspectable
public var viewCornerRadius: CGFloat{
set{
self.layer.cornerRadius = newValue
}get{
return self.layer.cornerRadius
}
}
@IBInspectable
var borderColor: UIColor? {
get {
let color = UIColor(cgColor: layer.borderColor!)
return color
}
set {
layer.borderColor = newValue?.cgColor
}
}
@IBInspectable
var borderWidth: CGFloat {
get {
return layer.borderWidth
}
set {
layer.borderWidth = newValue
}
}
@IBInspectable
var shadowRadius: CGFloat {
get {
return layer.shadowRadius
}
set {
layer.shadowRadius = newValue
}
}
@IBInspectable
var shadowOpacity: Float {
get {
return layer.shadowOpacity
}
set {
layer.shadowOpacity = newValue
}
}
@IBInspectable
var shadowOffset: CGSize {
get {
return layer.shadowOffset
}
set {
layer.shadowOffset = newValue
}
}
@IBInspectable
var shadowColor: UIColor? {
get {
if let color = layer.shadowColor {
return UIColor(cgColor: color)
}
return nil
}
set {
if let color = newValue {
layer.shadowColor = color.cgColor
} else {
layer.shadowColor = nil
}
}
}
}
发送表单时,应改用scrapy.FormRequest。如果您要发送的是原始正文,则只想使用method = post。在这种情况下,它看起来像是正式数据,所以可以这样做。
def start_requests(self):
您还可以使用其他工具(例如来自响应的表单请求)来帮助实现此目的。如果要发送原始json字符串或其他内容,则需要将字典转换为字符串,然后按照此处的操作将方法设置为POST。 FormRequest将自动发送一个POST请求,如果您使用from响应功能,它将很聪明。
参考: https://docs.scrapy.org/en/latest/topics/request-response.html#request-subclasses