我尝试使用BeautifulSoup解析xml
content = open(filename, encoding='utf-8').read()
return BeautifulSoup(content)
检查源文件的编解码器,它告诉我它应该是ascii
➜ worker git:(develop) ✗ chardetect ../complete_data/sample.xml git:(develop|✚9…
../complete_data/sample.xml: ascii with confidence 1.0
然而,它仍然会破坏我的程序,例外,
我怎么能修复它,而且,我怎么能知道将来正确的编码,而Python的异常消息太差了
Traceback (most recent call last):
File "parser_factory.py", line 97, in <module>
test_shareholder_meetings()
File "parser_factory.py", line 81, in test_shareholder_meetings
_import_source_files(collection_name="shareholder_meetings", dataset_name="WSH_BoD_Shareholder")
File "parser_factory.py", line 78, in _import_source_files
parser(f, collection_name).import_data()
File "/workspace/balala-wsh/worker/parser_base.py", line 21, in __init__
self.soup = self.read_file_in_bs(filename)
File "/workspace/balala-wsh/worker/parser_base.py", line 30, in read_file_in_bs
content = open(filename, encoding='utf-8').read()
File "/Users/sample_user/.pyenv/versions/3.4.3/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 180145: invalid continuation byte
答案 0 :(得分:1)
chardet
不会检查整个文件。如果它包含一个单独的0xE7,它肯定不是ASCII,显然也不是UTF-8。
也许https://tripleee.github.io/8bit#e7可以帮助您确定它到底是什么。
答案 1 :(得分:0)
你可以试试cp1252&#39;解码测试。
我相信你正在阅读的测试不是Unicode。