使用python读取CSV文件时的编码问题

时间:2015-12-02 15:10:39

标签: python csv encoding

尝试使用python读取CSV文件时遇到了障碍。

更新: 如果您只想跳过字符或错误,可以打开文件,如下所示:

with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file:

到目前为止,我已经尝试过了。

for directory, subdirectories, files in os.walk(root_dir):
    for file in files:
        with open(os.path.join(directory, file), 'r') as data_file:
            reader = csv.reader(data_file)
            for row in reader:
                print (row)

我得到的错误是:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

我试过了

with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file:

错误:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined>

现在,如果我只是打印data_file,它说它们是cp1252编码但是如果我尝试

with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file:

我得到的错误是:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

我也尝试了推荐的包。

我得到的错误是:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

我想解析的一行是:

2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT @WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None

任何想法或帮助都表示赞赏。

1 个答案:

答案 0 :(得分:1)

我会使用csvkit,它使用自动检测适当的编码和解码。 e.g。

import csvkit
reader = csvkit.reader(data_file)

正如聊天解决方案中所讨论的那样 -

for directory, subdirectories, files in os.walk(root_dir): 
    for file in files: 
        with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file: 
            reader = csv.reader(data_file) 
            for row in reader: 
                data = [i.encode('ascii', 'ignore').decode('ascii') for i in row] 
                print (data)