Question

我正在维基百科上做一些数据抓取，我想阅读某些条目。我正在使用urllib.urlopen('http://www.example.com')和urllib.read()

这样可以正常工作，直到遇到像Stanislav Šesták这样的非英文字符以下是前几行：

import urllib

print urllib.urlopen("http://en.wikipedia.org/wiki/Stanislav_Šesták").read()

结果：

<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>Stanislav ֵ estֳ¡k - Wikipedia, the free encyclopedia</title>
<meta name="generator" content="MediaWiki 1.23wmf8" />
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="edit" title="Edit this page" href="/w/index.php?title=Stanislav_%C5%A0est%C3%A1k&amp;action=edit" />
<link rel="apple-touch-icon" href="//bits.wikimedia.org/apple-touch/wikipedia.png" />

如何保留非英文字符？最后，此代码将在.txt文件中写入条目标题和URL。

Answer 1

有很多问题：

字符串文字中的非ascii字符：在这种情况下，您必须在模块顶部指定编码声明
你应该urlencode url路径（u"Stanislav_Šesták" - ＆gt; "Stanislav_%C5%A0est%C3%A1k"）
您正在将从网络收到的字节打印到您的终端。除非两者都使用相同的字符编码，否则您可能会看到垃圾而不是某些字符
解释html，你应该使用html解析器

这是一个考虑上述评论的代码示例：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cgi
import urllib
import urllib2

wiki_title = u"Stanislav_Šesták"
url_path = urllib.quote(wiki_title.encode('utf-8'))
r = urllib2.urlopen("https://en.wikipedia.org/wiki/" + url_path)
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset')
content = r.read()
unicode_text = content.decode(encoding or 'utf-8')
print unicode_text # if it fails; set PYTHONIOENCODING

使用python从网站读写非英文字符

1 个答案: