在Python中使用json.loads时,如何处理CSV中的非ascii字符?

时间:2017-06-27 03:29:48

标签: python json unicode

我查看了一些答案,包括this,但似乎没有人回答我的问题。

以下是CSV中的一些示例行:

ID: "@id"

这是我的代码:

_id category
ObjectId(56266da778d34fdc048b470b)  [{"group":"Home","id":"53cea0be763f4a6f4a8b459e","name":"Cleaning Services","name_singular":"Cleaning Service"}]
ObjectId(56266e0c78d34f22058b46de)  [{"group":"Local","id":"5637a1b178d34f20158b464f","name":"Balloon Dí©cor","name_singular":"Balloon Dí©cor"}]

这是我得到的:

import csv
import sys

from sys import argv
import json


def ReadCSV(csvfile):
with open('newCSVFile.csv','wb') as g:
    filewriter = csv.writer(g) #, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    with open(csvfile, 'rb') as f:
        reader = csv.reader(f) # ceate reader object
        next(reader) # skip first row

        for row in reader: #go trhough all the rows
            listForExport = [] #initialize list that will have two items: id and list of categories

            # ID section
            vendorId = str(row[0]) #pull the raw vendor id out of the first column of the csv
            vendorId = vendorId[9:33] # slice to remove objectdId lable and parenthases
            listForExport.append(vendorId) #add evendor ID to first item in list


            # categories section
            tempCatList = []  #temporarly list of categories for scond item in listForExport

            #this is line 41 where the error stems
            categories = json.loads(row[1]) #create's a dict with the categoreis from a given row

            for names in categories:  # loop through the categorie names using the key 'name'

                print names['name']

所以代码拉出了第一类Cleaning Services Traceback (most recent call last): File "csvtesting.py", line 57, in <module> ReadCSV(csvfile) File "csvtesting.py", line 41, in ReadCSV categories = json.loads(row[1]) #create's a dict with the categoreis from a given row File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads return _default_decoder.decode(s) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode obj, end = self.scan_once(s, idx) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-10: invalid continuation byte ,但是当我们找到非ascii字符时失败了。

我该如何处理?我很高兴删除任何非ascii项目。

2 个答案:

答案 0 :(得分:1)

当您在rb模式下打开输入csv文件时,我假设您使用的是Python2.x版本。好消息是你在csv部分没有问题,因为csv阅读器将读取普通字节而不试图解释它们。但json模块将坚持将文本解码为unicode,默认情况下使用utf8。由于您的输入文件不是utf8编码的是chokes并引发UnicodeDecodeError。

Latin1有一个很好的属性:任何字节的unicode值只是字节的值,所以你肯定要解码任何东西 - 它是否有意义然后取决于实际编码是Latin1 ...

所以你可以这么做:

categories = json.loads(row[1], encoding="Latin1")

或者,如果要忽略非ascii字符,可以先将字节字符串转换为unicode忽略错误,然后再加载json:

categories = json.loads(row[1].decode(errors='ignore))     # ignore all non ascii characters

答案 1 :(得分:0)

很可能你的csv内容中有某些非ascii字符。

import re

def remove_unicode(text):
    if not text:
        return text

    if isinstance(text, str):
        text = str(text.decode('ascii', 'ignore'))
    else:
        text = text.encode('ascii', 'ignore')

    remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')

    return remove_ctrl_chars_regex.sub('', text)

...
vendorId = remove_unicode(row[0])
...
categories = json.loads(remove_unicode(row[1]))