Question

所以我正在编写一个程序来使用urllib读取网页，然后使用＆＃34; html2text＆＃34;，将基本文本写入文件。但是，urllib.read（）提供的原始内容具有各种字符，因此会不断引发 UnicodeDecodeError 。

我当然用Google搜索了3个小时，得到了很多答案，比如使用HTMLParser或reload（sys），使用pdfkit或BeautifulSoup等外部模块，当然还有.encode / .decode。

重新加载sys然后执行 sys.setdefaultencoding（＆＃34; utf-8＆＃34;）授予我所需的结果，但IDLE和程序在此之后变得没有响应。

我尝试了.encode / .decode的每个变体与'utf-8＆＃39;和＆＃39; ascii＆＃39;，其中的参数有＆＃39;替换＆＃39;，＆＃39;忽略＆＃39;等等。出于某种原因，无论我提供什么参数，它每次都会引发相同的错误编码/解码。

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    page = urllib.urlopen(url)
    content = page.read()
    with open(name, 'wb') as w:
        HP_inst = HTMLParser.HTMLParser()
        content = content.encode('ascii', 'xmlcharrefreplace')
        if True: 
            #w.write(HTT.html2text( (HP_inst.unescape( content ) ).encode('utf-8') ) )
            w.write( HTT.html2text( content) )#.decode('ascii', 'ignore')  ))
            w.close()
            print "Saved!"

我必须有另一种方法或编码...请帮忙！

Side Quest：我有时不得不把它写到一个文件中，其名称包括不支持的字符，如＆＃34; G \ u00e9za Teleki＆＃34; +＆＃34; .txt＆＃34; 。如何过滤掉这些字符？

注意：

此功能存储在一个类中（提示＆＃34; self＆＃34;）。
使用python2.7
不想使用BeautfiulSoup
Windows 8 64位

Answer 1

您应该使用正确的编码解码来自urllib的内容，例如，utf-8 latin1取决于您获得的页面。

检测内容编码的方式是多种多样的。来自html中的标题或元组。我想使用一个编码检测模块，我忘了名字，你可以谷歌。

正确解码后，您可以在写入文件之前将其编码为您喜欢的任何编码

======================================

以下是使用chardet

的示例

$resume_lists = YourResumeModel::lists('title', 'id');

{{ Form::select('resume_id', $resume_lists) }}

Answer 2

您必须知道远程网页正在使用的编码。有很多方法可以实现这一点，但最简单的方法是使用Python-Requests库而不是urllib。请求返回预解码的Unicode对象。

然后，您可以使用编码文件包装器自动编码您编写的每个字符。

import requests
import io

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    req = requests.get(url)
    content = req.text # Returns a Unicode object decoded using the server's header
    with io.open(name, 'w', encoding="utf-8") as w: # Everything written to w is encoded to UTF-8
        w.write( HTT.html2text( content) )

    print "Saved"

Python HTML到文本文件UnicodeDecodeError？

2 个答案: