Question

我有一个用例，我想通过to_parquet（ddf，'TestParquet'，append = True）将多个Dask数据帧存储到一个公共镶木存储中。

镶木地板文件的结构是通过向其写入的第一个数据帧设置的（没有append = True）。

每个数据框都有“分类索引”列。

类别从一开始就在所有数据框中都是已知的，并且没有一个数据框具有相同的类别。

数据帧在整个类别列表上进行了分区（因此每个分区在保存到拼花地板之前都具有空分区）。最终，一旦所有daframe保存到镶木地板上，所有类别/分区将包含数据。

问题：追加第二个数据框后，索引不能用于检索特定类别。

这是一个最小的可重现示例：

熊猫：0.24.2 快速拼花：0.4.1 快手：2.22.0

import pandas as pd
import dask.dataframe as dd
import fastparquet

data= {'Name': [ 'C', 'B','F', 'B', 'F', 'B'], 'ID':[1, 1, 2, 2, 3, 3], 'Value':[2,3,4,1,2,3]}
df = pd.DataFrame(data)
df['Name']=df['Name'].astype(pd.api.types.CategoricalDtype(categories=['A', 'B', 'C', 'D', 'E', 'F'], ordered=True) )
ddf = dd.from_pandas(df, npartitions=2)
ddf=ddf.set_index('Name', sorted=False).repartition(divisions=['A', 'B', 'C', 'D', 'E', 'F', 'F'], force=True)
ddf_parquet=dd.read_parquet('./TestParquet')

data2= {'Name': ['D', 'E', 'A', 'A','D', 'A'], 'ID':[1, 1, 2, 3, 3, 4], 'Value':[1,2,3, 4,5,6]}
df2 = pd.DataFrame(data2)
df2['Name']=df2['Name'].astype(pd.api.types.CategoricalDtype(categories=['A', 'B', 'C', 'D', 'E', 'F'], ordered=True) )
ddf2 = dd.from_pandas(df2, npartitions=2)
ddf2=ddf2.set_index('Name', sorted=False).repartition(divisions=['A', 'B', 'C', 'D', 'E', 'F', 'F'], force=True)
dd.to_parquet(ddf2, './TestParquet', engine='fastparquet', append=True, ignore_divisions=True)
ddf_parquet2=dd.read_parquet('./TestParquet')

将第一个数据帧保存到镶木地板后，我可以使用索引而不会出现问题：

ddf_parquet.loc['B'].head()

      ID  Value
Name
F      3      2
F      2      4

但是，在附加第二个数据帧之后，尝试选择除第一个分区的索引值（'B'）以外的任何内容都会导致错误：

ddf_parquet2.loc['A'].head()


Traceback (most recent call last):
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4805, in get_slice_bound
    return self._searchsorted_monotonic(label, side)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4756, in _searchsorted_monotonic
    return self.searchsorted(label, side=side)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/base.py", line 1501, in searchsorted
    return self._values.searchsorted(value, side=side, sorter=sorter)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 1370, in searchsorted
    raise ValueError("Categorical not ordered\nyou can use "
ValueError: Categorical not ordered
you can use .as_ordered() to change the Categorical to an ordered one

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/dask/dataframe/methods.py", line 42, in try_loc
    return loc(df, iindexer, cindexer)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/dask/dataframe/methods.py", line 28, in loc
    return df.loc[iindexer]
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexing.py", line 1500, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexing.py", line 1867, in _getitem_axis
    return self._get_slice_axis(key, axis=axis)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexing.py", line 1533, in _get_slice_axis
    slice_obj.step, kind=self.name)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4673, in slice_indexer
    kind=kind)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4872, in slice_locs
    start_slice = self.get_slice_bound(start, 'left', kind)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4808, in get_slice_bound
    raise err
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4802, in get_slice_bound
    slc = self._get_loc_only_exact_matches(label)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 4772, in _get_loc_only_exact_matches
    return self.get_loc(key)
  File "/cba/local/mx/Linux_x86_64/python/Python-3.6.2/lib/python3.6/site-packages/pandas/core/indexes/category.py", line 438, in get_loc
    raise KeyError(key)
KeyError: 'A'

我不知道什么，为什么在索引中找不到'A'键，以及为什么错误提示要对类别进行排序，因为在设置索引之前已对'名称'类别进行了排序。

一个观察结果：

，两个数据帧的划分均按预期设置为：

（'A'，'B'，'C'，'D'，'E'，'F'，'F'）

保存到Parquet并读回数据后，在保存第一个数据帧后将分区重新对齐到第一个数据帧的非空分区：

('B', 'C', 'F', 'F')

因此，我必须在第二个Daframe后面附加ignore_partitions = True，否则我将收到一条错误消息，指出划分是重叠的（这正是为什么我在保存最终分区/之前先对整个列表类别进行了分区/从第一个数据帧开始划分。

重置索引和重新索引实际上是不可行的，因为实际数据集非常庞大（保存到Parquet之前总共约200GB）。

设置每个数据框的划分以匹配其实际类别，最终会导致相同的错误。

任何帮助将不胜感激。

无法在Append = True的镶木地板文件中存储的Dask数据帧上使用索引

0 个答案: