This是我要抓取的页面,而this是检索数据的AJAX请求。
我创建了具有相同标头和请求有效负载的相同AJAX请求。该请求不会失败,但是会得到一个几乎为空的JSON,其中没有任何数据。
AJAX请求的响应是一个JSON文件,其中一个键具有字符串形式的另一个JSON。由于输出很大,我认为问题可能与Content-Length
标头有关。当我使用Content-Length
标头时,请求失败,并显示400 Bad Request
,而当我不使用标头时,请求没有任何数据。
我应该如何从该网址获得有效的请求?
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = [
'https://www.propertyqueen.com.my/Search/SearchPropertyMarker'
]
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Host': 'www.propertyqueen.com.my',
'Origin': 'https://www.propertyqueen.com.my',
#'Content-Length': 689,
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Referer': 'https://www.propertyqueen.com.my/for-sale?searchtext=',
'Cookie': '_ga=GA1.3.513681266.1562266208; ASP.NET_SessionId=utadmp0lcxiobehzff5xpzyl; _gid=GA1.3.1978049576.1562853910; _gat=1',
}
payload = '{"SearchTextDisplay":"","SearchText":"","PropertyName":null,"State":"","City":"","PriceMin":50,"PriceMax":1000000,"BuildUpAreaMin":50,"BuildUpAreaMax":200000,"LandAreaMin":0,"LandAreaMax":1000000000000,"CosfMin":200,"CosfMax":1200,"PropertyFor":"ForSale","ListType":"","PropertyType":"-1","Bedroom":-1,"Bathroom":-1,"Carparking":-1,"Finishing":"-1","Furnishing":null,"Tenure":"-1","PropertyAge":"-1","FloorLebel":"-1","PageNo":1,"PageSize":10,"OpenTab":"","MinLat":0,"MaxLat":0,"MinLng":0,"MaxLng":0,"SortBy":"-1","zoom":0,"like":false,"suggestionrequired":false,"latitude":0,"longitude":0,"LandTitle":null,"CompletionYear":null,"TotalLotsUnit":null,"RentType":null,"PreferredTenant":null}'
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
method='POST',
headers=self.headers,
body=self.payload,
callback=self.parse_items
)
def parse_items(self, response):
print response.text.encode('utf-8')
答案 0 :(得分:0)
稍微修改一下蜘蛛,这会为我带来结果。
from scrapy.spiders import Spider
from scrapy import Request
class MySpider(Spider):
name = 'myspider'
start_urls = [
'https://www.propertyqueen.com.my/Search/SearchPropertyMarker'
]
headers = {
'Origin': 'https://www.propertyqueen.com.my',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.propertyqueen.com.my/for-sale',
}
payload = '{"SearchTextDisplay":"","SearchText":"","PropertyName":null,"State":"","City":"","PriceMin":50000,"PriceMax":100000000,"BuildUpAreaMin":50,"BuildUpAreaMax":200000,"LandAreaMin":0,"LandAreaMax":1000000000000,"CosfMin":200,"CosfMax":1200,"PropertyFor":"ForSale","ListType":"","PropertyType":"-1","Bedroom":-1,"Bathroom":-1,"Carparking":-1,"Finishing":"-1","Furnishing":null,"Tenure":"-1","PropertyAge":"-1","FloorLebel":"-1","PageNo":1,"PageSize":10,"OpenTab":"","MinLat":0,"MaxLat":0,"MinLng":0,"MaxLng":0,"SortBy":"-1","zoom":0,"like":false,"suggestionrequired":false,"latitude":0,"longitude":0,"LandTitle":null,"CompletionYear":null,"TotalLotsUnit":null,"RentType":null,"PreferredTenant":null}'
def start_requests(self):
for url in self.start_urls:
yield Request(
url=url,
method='POST',
headers=self.headers,
body=self.payload,
callback=self.parse_items
)
def parse_items(self, response):
print response.text.encode('utf-8')
我使用普通的Spider而不是CrawlSpider,并在标题中省略了“ cookie”。