Question

我在一个文件中有一个庞大的网页列表（大约180万）。我基本上想要查询这些网页中的每个网页以查找其中使用的字符编码。我可以使用wget，这将下载页面然后我可以grep for charset = pattern来获取编码。但我不想下载任何这些页面，只是查询编码。我怎样才能做到这一点？请建议我一些足够快的其他工具。

Answer 1

您可以使用python的requests库轻松完成此操作。

Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.head('http://www.google.com')
>>> r.encoding
'ISO-8859-1'

请注意使用head vs get方法（后者将下载整个页面）。

您还可以使用curl的-I标记为“Content-Type”行发出HEAD个请求和grep：

jjensen@jjensen-dev:~$ curl -I www.google.com
HTTP/1.1 200 OK
Date: Sun, 16 Feb 2014 09:05:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=081cb517341de334:FF=0:TM=1392541528:LM=1392541528:S=O2_rr0DFBFW5RtJS; expires=Tue, 16-Feb-2016 09:05:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=Ouu0WjP7K0cdtuLZ1XTRdETnNTIRbf1DjfopTXoFAdC84DnrQ03OsABMx7QUFlRJ3pPrvkmO8-2nUmVfjjpEMLg-CNlh7wBLmuf5xrbJN-qmPVp7zhfS39q9xrjIOk8B; expires=Mon, 18-Aug-2014 09:05:28 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
Transfer-Encoding: chunked

无需下载即可查询网页

1 个答案: