Question

我是Python编程的新手。我在我的Python文件中使用以下代码：

import gethtml
import articletext
url = "http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece"
result = articletext.getArticle(url)
text_file = open("Output.txt", "w")

text_file.write(result)

text_file.close()

文件articletext.py包含以下代码：

from bs4 import BeautifulSoup
import gethtml
def getArticleText(webtext):
    articletext = ""
    soup = BeautifulSoup(webtext)
    for tag in soup.findAll('p'):
        articletext += tag.contents[0]
    return articletext

def getArticle(url):
    htmltext = gethtml.getHtmlText(url)
    return getArticleText(htmltext)

但是我收到以下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 473: ordinal not in range(128)
To print the result into the output file, what proper code should I write ?

The output `result` is text in the form of a paragraph.

Answer 1

为了处理unicode错误，我们需要将文本编码为unicode（精确地说是UTF-8）而不是ascii。如果存在编码错误，为了确保它不会抛出错误，我们将忽略任何我们没有映射的字符。（您也可以使用“替换”或str.encode给出的其他选项。See the Python docs on Unicode here.）

打开文件的最佳做法是使用Python上下文管理器，即使出现错误也会关闭文件。我在路径中使用斜杠而不是反斜杠，以确保它在Windows或Unix / Linux中都有效。

text = text.encode('UTF-8', 'ignore')
with open('/temp/Out.txt', 'w') as file:
    file.write(text)

这相当于

text = text.encode('UTF-8', 'ignore')
try:
    file = open('/temp/Out.txt', 'w')
    file.write(text)
finally:
    file.close()

但是上下文管理器的冗长程度要小得多，而且在错误发生时锁定文件的可能性要小得多。

Answer 2

text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

这应该有用，试一试。

为什么呢？因为将所有内容保存为字节和utf-8，它将忽略这些编码错误：D

修改确保该文件存在于同一文件夹中，否则将此代码放在导入之后，它应该自己创建文件。

text_filefixed = open("Output.txt", "a") text_filefixed.close()

它创建它，不保存任何内容，关闭文件......但它是在没有人工交互的情况下自动创建的。

<强> EDIT2 请注意，这仅适用于3.3.2，但我知道您可以使用此模块在2.7中实现相同的功能。一些细微的差别是（我认为）2.7中不需要请求，但你应该检查一下。

from urllib import request result = str(request.urlopen("http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece").read()) text_filefixed = open("Output.txt", "wb") text_filefixed.write(bytes(result, 'UTF-8')) text_filefixed.close()

就像我一样，你只会在2.7 urllib.request in Python 2.7
中找到这个错误

从Python中的Unicode Web Scrape输出ascii文件

2 个答案: