Question

我在pandas（v21.1）中构建了一个数据帧（Python 3，Windows（220k行）并写出到csv。在Excel中打开，文件看起来很好（220k行）。使用pandas读入，现在该文件有额外的40k行，并且通常有各种编码错误。

尝试了多种to_csv / read_csv encoding=组合，包括： utf-8，utf-8-sig，cp1252，ascii和utf-16 写出：

encoding='cp1252' or 'ascii' - UnicodeEncodeError: 'charmap' codec can't encode character '\u1e28' in position 261: character maps to <undefined>
encoding='utf-8',`utf-8-sig`,`utf-16`,`cp1252`,  - no Python error in the console, but still doesn't render correctly when I import it again.

在阅读时，我经常收到警告： DtypeWarning: Columns (0,1,3,4,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27,28,29,30,31,32,37,38,39,40,41,42,43,46,47,48,49,50,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,91,92,93,94,95,96,97,98,99,100,101,102) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

我已尝试为列添加dtypes，方法是在dtypes保存to_csv dict并使用相同的dict作为read_csv的输入 - 但它也给了发现错误，因为发现了意外的数据类型，例如ValueError: Integer column has NA values in column 33

当我作为Excel文件执行时，它似乎工作正常。当我尝试使用Python 2.7安装时，会出现同样的问题。

我怀疑这个问题可能是我导入的第三方csv文件，只有在我使用'cp1252'时才会导入。我尝试使用utf-8在Excel中重新保存此输入文件 - 但这也没有用。

感谢您的建议！

Answer 1

你得到的

DtypeWarning是因为pandas无法推断出所有这些列的数据类型。在str参数中设置dtype将使警告静音。

参考：https://stackoverflow.com/a/27232309/5182482

使用pandas读入，现在该文件有额外的40k行，并且通常有各种编码错误。

我无法确切地告诉你这个问题。

pandas to_csv read_csv编码错误

1 个答案: