在长循环期间,dtype在read_csv中看似被忽略了

时间:2017-01-15 18:41:22

标签: python pandas

我在循环浏览一长串文件时遇到了read_csv错误。这是我如何重现它。

给出以下伪代码:

import pandas as pd

ll = []
for f in ec_files:
    fdf = pd.read_csv(f, dtype=dtypes, header=None,
                      parse_dates=[0, 1], index_col=1, names=colnames,
                      na_values=["NAN"], true_values=["t"],
                      false_values=["f"], low_memory=False)
    ll.append(pd.DataFrame(fdf.mean()).transpose())

其中ec_files是输入文件的路径名的长列表,colnames是列名列表,dtypes是每列的dtype字典。单独读取文件时没有任何问题,但是当使用上述循环时,进程将停止在具有以下跟踪的随机文件中:

  

回溯(最近一次呼叫最后一次):文件" junk.py",第15行,在          false_values = [" f"],low_memory = False)文件" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;,第645行,in   parser_f       return _read(filepath_or_buffer,kwds)File" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;, line 400 in   _读       data = parser.read()File" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;,第938行,in   读       ret = self._engine.read(nrows)File" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;,第1505行,in   读       data = self._reader.read(nrows)文件" pandas / parser.pyx",第849行,在pandas.parser.TextReader.read(pandas / parser.c:9907)文件   " pandas / parser.pyx",第945行,在pandas.parser.TextReader._read_rows中   (pandas / parser.c:11161)文件" pandas / parser.pyx",第1047行,在   pandas.parser.TextReader._convert_column_data(pandas / parser.c:12536)   文件" pandas / parser.pyx",第1126行,in   pandas.parser.TextReader._convert_tokens(pandas / parser.c:13783)   ValueError:float()的无效文字:06-06 04:02:24.2

在解析过程中似乎有些东西会中断。为什么在使用循环时会发生这种情况,而不是发生此跟踪的文件?

虽然有些人可能会觉得很难相信,但每次运行脚本时发生错误的文件都会发生变化,但是这些文件会被单独读取而没有任何问题。下面显示了另一个脚本失败运行的追溯:

  

回溯(最近一次呼叫最后一次):文件" junk.py",第16行,在          false_values = [" f"],low_memory = False)文件" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;,第645行,in   parser_f       return _read(filepath_or_buffer,kwds)File" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;, line 400 in   _读       data = parser.read()File" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;,第938行,in   读       ret = self._engine.read(nrows)File" /usr/lib/python2.7/dist-packages/pandas/io/parsers.py" ;,第1505行,in   读       data = self._reader.read(nrows)文件" pandas / parser.pyx",第849行,在pandas.parser.TextReader.read(pandas / parser.c:9907)文件   " pandas / parser.pyx",第945行,在pandas.parser.TextReader._read_rows中   (pandas / parser.c:11161)文件" pandas / parser.pyx",第1047行,在   pandas.parser.TextReader._convert_column_data(pandas / parser.c:12536)   文件" pandas / parser.pyx",第1126行,in   pandas.parser.TextReader._convert_tokens(pandas / parser.c:13783)
  ValueError:float()的文字无效:1.5635078898.16

从上面发生最后一次追溯的文件顶部开始几行:

2016-06-24 14:00:00,2016-06-24 14:00:00,-63.202653,67.693223,0.10,317.200,248.200,0.250,-0.770,-0.010,99.50,0.45,93.39,,,,1.12458829806343,1.56350788627135,1265.86,398.16,332.80,0.05078614,0.0061028,0.9117393,0.1835912,-0.4333494,-0.8065823,-0.649,1.14,-0.029,0.98,332.29,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2016-06-24 14:00:00,2016-06-24 14:00:00.1,-63.202653,67.693223,0.10,317.200,248.200,0.250,-0.770,-0.010,99.50,0.45,93.39,,,,1.12458829806343,1.56350788627135,1265.86,398.16,332.80,0.05210823,0.005970591,0.9118717,1.419696,-0.05049266,-1.156707,-0.638,1.139,-0.02,0.93,332.26,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2016-06-24 14:00:00,2016-06-24 14:00:00.2,-63.202653,67.693223,0.10,317.200,248.200,0.250,-0.770,-0.010,99.50,0.45,93.39,,,,1.12458829806343,1.56350788627135,1265.86,398.16,332.80,0.05038951,0.005441753,0.9117393,-0.2251475,0.1442362,0.6797946,-0.625,1.165,-0.017,0.95,332.27,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2016-06-24 14:00:00,2016-06-24 14:00:00.3,-63.202653,67.693223,0.10,317.200,248.200,0.250,-0.770,-0.010,99.50,0.45,93.39,,,,1.12458829806343,1.56350788627135,1265.86,398.16,332.80,0.05224044,0.0061028,0.9113424,-1.813954,-0.4432509,-0.2021224,-0.629,1.161,-0.041,0.97,332.28,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2016-06-24 14:00:00,2016-06-24 14:00:00.4,-63.202653,67.693223,0.10,317.200,248.200,0.250,-0.770,-0.010,99.50,0.45,93.39,,,,1.12458829806343,1.56350788627135,1265.86,398.16,332.80,0.05157939,0.00623501,0.9118717,0.1374433,-0.2023152,-1.166616,-0.595,1.166,-0.009,0.98,332.29,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

我的熊猫版本:

In [20]: pd.__version__
Out[20]: 
'0.19.0+git14-ga40e185'

同样,这个(以及发生这些随机错误的任何文件)使用完全相同的read_csv命令自行读取。我担心这可能需要提供所有文件来进行交流。

感谢您的任何反馈, SEB

0 个答案:

没有答案