使用Apache Parquet在第一步键入错误

时间:2018-02-04 19:13:13

标签: python pandas csv data-science parquet

第一次尝试Apache Parquet文件格式时遇到此类错误而感到困惑。 Parquet不应该支持Pandas所做的所有数据类型吗?我错过了什么?

import pandas
import pyarrow
import numpy

data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)

提出:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)

table.pxi in pyarrow.lib.Table.from_pandas()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
    354             arrays = list(executor.map(convert_column,
    355                                        columns_to_convert,
--> 356                                        convert_types))
    357 
    358     types = [x.type for x in arrays]

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    423                 raise CancelledError()
    424             elif self._state == FINISHED:
--> 425                 return self.__get_result()
    426 
    427             self._condition.wait(timeout)

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
    343 
    344     def convert_column(col, ty):
--> 345         return pa.array(col, from_pandas=True, type=ty)
    346 
    347     if nthreads == 1:

array.pxi in pyarrow.lib.array()

array.pxi in pyarrow.lib._ndarray_to_array()

error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer

data.dtypes是:

0      object
1      object
2      object
3      object
4      object
5     float64
6     float64
7      object
8      object
9      object
10     object
11     object
12     object
13    float64
14     object
15    float64
16     object
17    float64
...

2 个答案:

答案 0 :(得分:2)

在Apache Arrow中,表列的数据类型必须是同类的。 pandas支持Python对象列,其中值可以是不同的类型。因此,在写入Parquet格式之前,您需要进行一些数据清理。

我们已经在Arrow-Python绑定中处理了一些基本情况(如列中的字节和unicode),但我们不会对任何关于如何处理错误数据的猜测产生危害。我打开了JIRA https://issues.apache.org/jira/browse/ARROW-2098关于添加一个选项,以便在这种情况下将意外值强制为null,这可能在将来有所帮助。

答案 1 :(得分:1)

有同样的问题并花了一些时间找出找到违规列的方法。这是我想出的混合型列 - 虽然我知道必须有一种更有效的方法。

异常发生之前打印的最后一列是混合类型列。

# method1: try saving the parquet file by removing 1 column at a time to 
# isolate the mixed type column.
cat_cols = df.select_dtypes('object').columns
for col in cat_cols:
    drop = set(cat_cols) - set([col])
    print(col)
    df.drop(drop, axis=1).reset_index(drop=True).to_parquet("c:/temp/df.pq")

另一种尝试 - 根据唯一值列出列和每种类型。

# method2: list all columns and the types within
def col_types(col):
    types = set([type(x) for x in col.unique()])
    return types

df.select_dtypes("object").apply(col_types, axis=0)