如何将UTF-8字符串转换成中文?

时间:2016-03-25 10:49:33

标签: python python-3.x utf-8

这是我的代码。(python版本3.5)

log =os.path.join(sys.path[0],'log')
f=open(log,'r',encoding='utf-8')
s=f.read()
r=s.decode('utf-8')

此时我收到错误消息。

AttributeError: 'str' object has no attribute 'decode'

log文件可能是这样的:

\/div>\n\t<\/div>\n\t<\/div>\n  <!-- <div class=\"search_feedback\">\n  <p>\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5<a href=\"javascript:void(0);\" suda-data=\"key=tblog_search_v4.1&value=weibo_suggest\" node-type=\"suggest\">\u53d1\u8868\u610f\u89c1<\/a>\u6216\u60a8\u53ef\u4ee5\u5173\u6ce8\u840c\u5c0f\u641c<a href=\"http:\/\/weibo.com\/wbsearch\" suda-data=\"key=tblog_search_v4.1&value=weibo_xiaosou\" title=\"\u6b22\u8fce\u8c03\u620f\u6700\u840c\u5b98\u535a\u5c4c\u4e1d~~\">@\u5fae\u535a\u641c\u7d22<\/a>\u83b7\u53d6\u641c\u7d22\u6280\u5de7\u3002<\/p>\n <\/div> -->\n<\/div>"})</script>
<script>STK && STK.pageletM && STK.pageletM.view({"pid":"pl_common_searchHistory","js":["apps\/search_v6\/js\/pl\/common\/searchHistory.js?version=20160324190000"],"css":["appstyle\/searchV45\/css_v6\/pl\/pl_history.css?version=20160324190000"],"html":""})</script>

实际上,它是HTML和UTF-8字符的组合。当我使用exec时,我认为因为它包含大量'",解释器会出错{{1 }}

还有其他方法可以解决吗?

3 个答案:

答案 0 :(得分:2)

将文件读作bytes/binary,然后使用bytes.decode('unicode_escape')

>>> b'\\">\\n  <p>\\u6b22\\u8fce\\u63d0\\u4ea4'.decode('unicode_escape')
'">\n  <p>欢迎提交'

因此你可以这样做:

log = os.path.join(sys.path[0],'log')
with open(log, 'rb') as f:
     s = f.read()
     print(s.decode('unicode_escape'))

另外,如果你有一个字符串的完整Python repr,请说"\u8f6c\u53d1"(与问题中的字符串不同),那么你可以使用ast.literal_eval()

>>> s = '"\\u8f6c\\u53d1"'
>>> print(s)
"\u8f6c\u53d1"
>>> import ast
>>> u = ast.literal_eval(s)
>>> print(u)
转发

答案 1 :(得分:0)

您可能会发现以下信息有用。

In [25]: s='this sentence with some UTF-8 characters\u8f6c\u53d1'.encode('utf-8')

In [26]: s.decode('utf-8')
Out[26]: 'this sentence with some UTF-8 characters转发'

In [34]: type('this sentence with some UTF-8 characters\u8f6c\u53d1')
Out[34]: builtins.str

In [35]: type('this sentence with some UTF-8 characters\u8f6c\u53d1'.encode('utf-8'))
Out[35]: builtins.bytes

In [36]: type('this sentence with some UTF-8 characters\u8f6c\u53d1'.encode('utf-8').decode('utf-8'))
Out[36]: builtins.str

我猜this sentence with some UTF-8 characters\u8f6c\u53d1是一个包含unicode代码点的字符串(ascii在unicode中是相同的) 我不确定python是否包含72(无论A的unicode代码点是否为A等)。

答案 2 :(得分:0)

在程序的头部使用'#coding:utf8'。