我有这个url,它的响应内容包含一些JSON数据。
https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query=sadaf%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true
每次我在浏览器中使用不同的查询粘贴此网址时,我都会得到一个不错的JSON结果。但在scrapy或scrapy shell中,我没有得到任何结果。这是我的scrapy蜘蛛类:
link = "https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true"
def start_requests(self):
files = [f for f in listdir('results/') if isfile(join('results/', f))]
for file in files:
with open('results/' + file, 'r', encoding="utf8") as tour_info:
tour = json.load(tour_info)
for hotel in tour["hotels"]:
yield scrapy.Request(self.link.format(hotel))
name = 'tripadvisor'
allowed_domains = ['tripadvisor.com']
def parse(self, response):
print(response.body)
对于这段代码,在scrapy shell中,我得到了这个结果:
b'{"normalized":{"query":""},"query":{},"results":[],"partial_content":false}'
在scrapy命令行中,通过运行spider,我首先得到每个url的Forbidden by robots.txt
错误。我将scrapy ROBOTSTXT_OBEY
更改为False
,因此它不遵守此文件。现在我为每个请求获得[]
,但我应该得到一个像这样的JSON对象:
[
{
"urls":[
{
"url_type":"hotel",
"name":"Sadaf Hotel, Dubai, United Arab Emirates",
"type":"HOTEL",
"url":"\/Hotel_Review-g295424-d633008-Reviews-Sadaf_Hotel-Dubai_Emirate_of_Dubai.html"
}
],
.
.
.
答案 0 :(得分:0)
尝试从网址中删除sessionID,然后检查"不友好"你的settings.py是。 (另见this blog)
但使用Wget可能更容易,例如wget 'https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&nearPages=true' -O results.json