Question

我正在使用Python 2.7解析一堆网页并从中获取内容，但网页包含“”和“等字符，这些字符都以某种方式转换为”ô“。这给了我一个内容看起来像这样的文件（不包括引号）：“我认为它，非常重要......”

使用print()方法在终端中打印出的字符串很好，但我似乎无法使用print >> file, string或file.write(string)获得相同的效果。显然这是一个编码问题，但我搜索没有成功找到解决方法。我正在打开这样的文件：file = codecs.open("file.txt","w+", encoding='utf-8')我正在使用BeautifulSoup4的getText()方法为字符串分配值。有什么方法可以解决这个问题吗？

Answer 1

您可以尝试将其写为：

file.write(output_str.encode('utf-8', 'ignore'))

Answer 2

在代码的开头强制执行utf8编码：

#!/usr/bin/python
# -*- coding: utf-8 -*-
myfile = open('./myfile.txt', 'w')
myfile.write("I think it's important to be able to see all characters")
myfile.write("\nIt woùld be Ñìçè to see foreign letters as well")
myfile.write("\n")
myfile.close()

Answer 3

一些源代码会很好。

BeautifulSoup通常可以很好地猜测给定字符串的编码：

from bs4 import BeautifulSoup as bs4

>>> print bs4("\x80", "html.parser").text # Windows 1252
€

>>> print bs4("\xe2\x82\xac", "html.parser").text # UTF-8
€

除非它不能：

>>> print bs4("\xa4", "html.parser").text # ISO-8859-15
¤

因此，您应该将BeautifulSoup传递给已解码的Unicode：

>>> print bs4("\xa4".decode("iso-8859-15"), "html.parser").text # ISO-8859-15
€

这意味着您的输入数据需要正确解码。使用io.open(filename, "r", encoding="utf-8")（或适当的编码）打开输入文件。

如果要拉远程网站，请检查＆＃34;内容类型＆＃34; header或use Requests，它在响应对象的.text属性中返回已解码的Unicode。

写入文件时，您可以使用编解码器模块。 io模块是更新的方法。

当您将所有这些放在一起时，您将编写已正确编码的数据。

Answer 4

尝试在功能开始时添加以下行代码，这将解决您的问题。

        import sys
        reload(sys)
        sys.setdefaultencoding('utf8')

写入文件

4 个答案: