我试图使用open-uri
来抓取数据:
https://www.zomato.com/grande-lisboa/fu-hao-massamá
但是,该网站会自动重定向到:
https://www.zomato.com/pt/grande-lisboa/fu-hao-massamá
我不想要西班牙语版本。我想要英语。我怎么告诉红宝石停止这样做?
答案 0 :(得分:3)
这称为content negotiation - 网络服务器根据您的请求重定向。 pt
(葡萄牙语)似乎是默认的:(至少从我的位置)
$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=pt; ...
Location: https://www.zomato.com/pt/grande-lisboa/fu-hao-massam%C3%A1
您可以通过发送Accept-Language
标头来请求其他语言。以下是Accept-Language: es
(西班牙语)的答案:
$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: es"
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=es_cl; ...
Location: https://www.zomato.com/es/grande-lisboa/fu-hao-massam%C3%A1
这是Accept-Language: en
(英文)的答案:
$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: en"
HTTP/1.1 200 OK
Set-Cookie: zl=en; ...
这似乎是您一直在寻找的资源。
在Ruby中你会使用:
require 'nokogiri'
require 'open-uri'
url = 'https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1'
headers = {'Accept-Language' => 'en'}
doc = Nokogiri::HTML(open(url, headers))
doc.at('html')[:lang]
#=> "en"