net / http自动将网页重定向到另一种语言

时间:2015-02-04 10:23:35

标签: ruby web-scraping nokogiri net-http open-uri

我试图使用open-uri来抓取数据:

https://www.zomato.com/grande-lisboa/fu-hao-massamá

但是,该网站会自动重定向到:

https://www.zomato.com/pt/grande-lisboa/fu-hao-massamá

我不想要西班牙语版本。我想要英语。我怎么告诉红宝石停止这样做?

1 个答案:

答案 0 :(得分:3)

这称为content negotiation - 网络服务器根据您的请求重定向。 pt(葡萄牙语)似乎是默认的:(至少从我的位置)

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=pt; ...
Location: https://www.zomato.com/pt/grande-lisboa/fu-hao-massam%C3%A1

您可以通过发送Accept-Language标头来请求其他语言。以下是Accept-Language: es(西班牙语)的答案:

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: es"
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=es_cl; ...
Location: https://www.zomato.com/es/grande-lisboa/fu-hao-massam%C3%A1

这是Accept-Language: en(英文)的答案:

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: en"
HTTP/1.1 200 OK
Set-Cookie: zl=en; ...

这似乎是您一直在寻找的资源。

在Ruby中你会使用:

require 'nokogiri'
require 'open-uri'

url = 'https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1'
headers = {'Accept-Language' => 'en'}

doc = Nokogiri::HTML(open(url, headers))
doc.at('html')[:lang]
#=> "en"