Question

我知道在使用Python 2.x时这是一个永远存在的问题。我目前正在使用Python 2.7。我想要输出到制表符分隔文本文件的文本内容是从服务器排序规则设置为SQL_Latin1_General_CP1_CI_AS的Sql Server 2012数据库表中提取的。

我得到的例外情况往往略有不同，但基本上是： UnicodeDecodeError：＆＃39; ascii＆＃39;编解码器不能解码位置57中的字节0xa0：序数不在范围内（128）

或 UnicodeDecodeError：＆＃39; ascii＆＃39;编解码器不能解码位置308中的字节0xe2：序数不在范围内（128）

现在这是我通常会转向的，但仍然会导致错误：

from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
    #content processing done here
    #title is text pulled directly from database
    #just_text is content pulled from raw html inserted into beautiful soup
    #    and using its .get_text() to just retrieve the text content
    UTF8Writer = getwriter('utf8')
    myfile = UTF8Writer(myfile)
    myfile.write(text + '\t' + just_text)

我也尝试过：

# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')

and

title = title.decode('latin-1')
title = title.encode('utf-8')

and

title = unicode(title, 'latin-1')

我还将with open()替换为：

with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:

我不确定我做错了什么，或者忘了做什么。我还用解码交换了编码，以防万一我正在向后编码/解码。没有成功。

任何帮助将不胜感激。

更新

我第一次从数据库中检索print repr(title)时以及执行print repr(just_text)时都添加了title和.get_text()。不知道这有多大帮助，但......

标题我得到：<type 'str'> 对于just_text，我得到：<type 'unicode'>

错误

这些是我从BeautifulSoup Summary()函数中提取的内容中获得的错误。

C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':

ValueError: Expected a bytes object, not a unicode object

追溯部分是：

File <myfile>, line 39, in <module>
  summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
  self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
  for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
  self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
  return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
  raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object

Answer 1

这里有一些建议。一切都有编码。您的问题只是找出不同部分的各种编码，将它们重新编码为通用格式，并将结果写入文件。

我建议选择utf-8作为输出编码。

f = open('output', 'w')
unistr = title.decode("latin-1") + "\t" + just_text
f.write(unistr.encode("utf-8"))

美丽的汤＆＃39; get_text返回python的unicode包装类型。 decode("latin-1")应该将您的数据库内容转换为unicode类型，该类型在写入utf-8中编码的字节之前与选项卡连接。

Answer 2

问题在于您混合使用字节和Unicode文本：

>>> u'\xe9'.encode('utf-8') + '\t' + u'x'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

其中u'\xe9'.encode('utf-8')是使用utf-8编码对é character (U+00e9)进行编码的字节字符串。 u'x'是包含x character (U+0078)的Unicode文本。

解决方案是使用Unicode文本：

>>> print u'\xe9' + '\t' + u'x'
é       x

BeautifulSoup接受Unicode输入：

>>> import bs4
>>> bs4.BeautifulSoup(u'\xe9' + '\t' + u'x')
<html><body><p>é        x</p></body></html>
>>> bs4.__version__
'4.2.1'

避免与Unicode进行不必要的转换。将一次输入数据解码为Unicode并在任何地方使用它来表示程序中的文本，并在最后将输出编码为字节（如有必要）：

with open('output.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

写入文件时Unicode编码错误

更新

错误

2 个答案: