Python unicode文件重新编码

时间:2016-04-24 20:47:26

标签: python json unicode

在SO中有很多关于Unicode解码的问题,并且这个集合都不符合我的需要。

说,我需要处理一些json数据,其部分如下所示:

  u'message': {
    u'body': u'\u0441\u043f\u0430\u0441\u0438\u0431\u043e \u0431\u043e\u043b\u044c\u0448\u043e\u0435',
    u'user_id': 374298,
    u'title': u' ... ',
    u'date': 1461496370,
    u'read_state': 0,
    u'id': 4325364,
    u'out': 1
  }

唯一的目标是摆脱unicode以获取人体可读的文字" body"行:

  'message': {
    'body': 'cпасибо большое',
    'user_id': 374298,
    'title': ' ... ',
    'date': 1461496370,
    'read_state': 0,
    'id': 4325364,
    'out': 1
  }

只要有大量的json数据,我就修改了一些发现的python源代码,如下所示:

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

filePath = "/home/user/foo.txt"

outPath = filePath + ".new"

outText = ""

inF = open(filePath, "rb")
fContent = unicode(inF.read(), 'utf-8')
inF.close()

for everysymbol in fContent:
    if isinstance(everysymbol, unicode):
        outText += everysymbol.decode('cp1251')
    else: 
        pass

fContent = outText


outF = open(outPath, "wb")
outF.write(outText.decode("utf-8"))
outF.close()

不幸的是,此代码不会更改目标文件中的任何内容。我如何真正解码我的数据是人类可读的?

2 个答案:

答案 0 :(得分:4)

您的数据不是json,而是已经解析过的json数据。 Python 2使用repr()打印字典和列表,它将非ASCII字符显示为转义码。直接打印字符串以查看其内容,该内容使用str()

import json

# Your already parsed data
data = {u'message': {
     u'body': u'\u0441\u043f\u0430\u0441\u0438\u0431\u043e \u0431\u043e\u043b\u044c\u0448\u043e\u0435',
     u'user_id': 374298,
     u'title': u' ... ',
     u'date': 1461496370,
     u'read_state': 0,
     u'id': 4325364,
     u'out': 1}}

# Write it to a file to generate real json-encoded data:
with open('foo.json','w') as f:
    json.dump(data,f,indent=2)

# Display actual json-encoded file:
with open('foo.json') as f:
    print f.read()
print

# reparse the data
with open('foo.json') as f:
    data = json.load(f)

# Printing dictionaries and lists escapes non-ASCII
print data
print

# But print the string directly and it is fine (assuming your terminal and font supports the characters):

print data[u'message'][u'body']

输出:

{
  "message": {
    "body": "\u0441\u043f\u0430\u0441\u0438\u0431\u043e \u0431\u043e\u043b\u044c\u0448\u043e\u0435", 
    "user_id": 374298, 
    "title": " ... ", 
    "date": 1461496370, 
    "read_state": 0, 
    "id": 4325364, 
    "out": 1
  }
}

{u'message': {u'body': u'\u0441\u043f\u0430\u0441\u0438\u0431\u043e \u0431\u043e\u043b\u044c\u0448\u043e\u0435', u'user_id': 374298, u'title': u' ... ', u'date': 1461496370, u'read_state': 0, u'id': 4325364, u'out': 1}}

спасибо большое

请注意,Python 3不再为repr()转义可打印的非ASCII。

答案 1 :(得分:0)

#!/usr/bin/env python3
import json

filePath = "/home/user/foo.txt"

outPath = filePath + ".new"

with open(filePath, "rb") as f:
    fContent = json.loads(f.read())

with open(outPath, "wb") as f:
    f.write(json.dumps(fContent, ensure_ascii=False).encode())