读取CSV文件并验证基于UTF-8字符的列

时间:2016-09-30 22:33:47

标签: python python-3.x csv unicode utf-8

我必须读取包含列(PersonName,年龄,地址)的CSV文件,我必须验证PersonName。 “PersonName可能只包含UTF-8字符。”

我正在使用python3.x,因此打开文件后无法使用解码方法。

请告诉我如何打开和阅读文件,以便可以忽略不包含任何UTF-8字符的PersonName,我可以转到下一行进行验证。

1 个答案:

答案 0 :(得分:0)

假设文件的其余部分不需要检查或UTF-8合法(包括ASCII数据),您可以使用openencoding='utf-8' errors='replace'文件。这会将任何无效字节(以UTF-8编码)更改为Unicode替换字符\ufffd。或者,为了保留数据,您可以使用'surrogateescape'作为errors处理程序,该处理程序使用私有使用的Unicode代码以稍后可以撤消的方式表示原始值。然后,您可以随时查看这些内容:

with open(csvname, encoding='utf-8', errors='replace', newline='') as f:
    for PersonName, age, address in csv.reader(f):
        if '\ufffd' in PersonName:
            continue
        ... PersonName was decoded without errors, so process the row ...

或者使用surrogateescape,您可以确保在写入时恢复其他字段中的任何非UTF-8数据(如果"可能"):

with open(incsvname, encoding='utf-8', errors='surrogateescape', newline='') as inf,\
     open(outcsvname, 'w', encoding='utf-8', errors='surrogateescape', newline='') as outf:
    csvout = csv.writer(outf)
    for PersonName, age, address in csv.reader(f):
        try:
            # Check for surrogate escapes, and reject PersonNames containing them
            # Most efficient way to do so is a test encode; surrogates will fail
            # to encode with default error handler
            PersonName.encode('utf-8')
        except UnicodeEncodeError:
            continue  # Had non-UTF-8, skip this row

        ... PersonName was decoded without surrogate escapes, so process the row ...

        # You can recover the original file bytes in your code for a field with:
        #     fieldname.encode('utf-8', errors='surrogateescape')
        # Or if you're just passing data to a new file, write the same strings
        # back to a file opened with the same encoding/errors handling; the surrogates
        # will be restored to their original values:
        csvout.writerow([PersonName, age, address])