功能

Question

我有一个大型数据集，其中包含许多（压缩的）JSON格式的列。我正在尝试将其转换为实木复合地板，以进行后续处理。一些列具有嵌套结构。现在，我想忽略此结构，仅将这些列写为（JSON）字符串。

所以对于我已经确定的列，我正在做：

df[column] = df[column].astype(str)

但是，我不确定哪些列是嵌套的，哪些不是。用镶木地板书写时，会看到以下消息：

<stack trace redacted> 

  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>

这表明我无法将我的一列从嵌套对象转换为字符串。但是应该归咎于哪一栏？我怎么找到？

在打印熊猫数据框的.dtypes时，由于字符串和嵌套值都显示为object，因此无法区分字符串和嵌套值。

编辑：通过显示结构详细信息，该错误提供了有关嵌套列的提示，但是调试非常耗时。而且它只会输出第一个错误，如果您有多个嵌套的列，这会很烦人

Answer 1

将嵌套结构铸造为字符串

如果我正确理解了您的问题，那么您想将df中的那些嵌套Python对象（列表，字典）序列化为JSON字符串，并使其他元素保持不变。最好编写自己的转换方法：

def json_serializer(obj):
    if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
        return json.dumps(obj)
    return obj

df = df.applymap(json_serializer)

如果数据帧很大，则使用astype(str)会更快。

nested_cols = []
for c in df:
    if any(isinstance(obj, [list, dict]) for obj in df[c]):
        nested_cols.append(c)

for c in nested_cols:
    df[c] = df[c].astype(str) # this convert every element in the column independent of their types

由于对any(...)的调用中进行了短路评估，因此该方法具有性能优势。一旦击中列中的第一个嵌套对象，它将立即返回，并且不会浪费时间检查其余部分。如果任何一种“ Dtype自省”方法都适合您的数据，则使用该方法会更快。

检查最新版本的pyarrow

我假设仅将这些嵌套结构转换为字符串是因为它们会导致pyarrow.parquet.write_table中的错误。也许您根本不需要转换它，因为在pyarrow中处理嵌套列的问题已经reportedly solved recently（2020年3月29日，版本0.17.0）。但是该支持可能存在问题，并且在active discussion下。

Answer 2

使用Pyspark和流数据集时，我遇到了类似问题，有些列是嵌套的，有些则不是。

鉴于您的数据框可能如下所示：

df = pd.DataFrame({'A' : [{1 : [1,5], 2 : [15,25], 3 : ['A','B']}],
                   'B' : [[[15,25,61],[44,22,87],['A','B',44]]],
                   'C' : [((15,25,87),(22,91))],
                   'D' : 15,
                   'E' : 'A'
                  })


print(df)

                                         A  \
0  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B                         C   D  E  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  ((15, 25, 87), (22, 91))  15  A

我们可以堆叠您的数据框，并将apply与type一起使用，以获取每一列的类型并将其传递给字典。

df.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
out:
{'A': dict, 'B': list, 'C': tuple, 'D': int, 'E': str}

使用此函数，我们可以使用函数返回嵌套和未嵌套列的元组。

功能

def find_types(dataframe):

    col_dict = dataframe.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
    unnested_columns = [k for (k,v) in col_dict.items() if v not in (dict,set,list,tuple)]
    nested_columns = list(set(col_dict.keys()) - set(unnested_columns))
    return nested_columns,unnested_columns

行动中。

nested,unested = find_types(df)

df[unested]

   D  E
0  15  A

print(df[nested])

                          C                                        A  \
0  ((15, 25, 87), (22, 91))  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]

Answer 3

在熊猫中使用infer_dtype()之类的通用工具功能，可以确定该列是否嵌套。

from pandas.api.types import infer_dtype

for col in df.columns:
  if infer_dtype(df[col]) == 'mixed' : 
    # ‘mixed’ is the catchall for anything that is not otherwise specialized
    df[col] = df[col].astype('str')

如果您要定位特定的数据类型，请参见Dtype Introspection

Answer 4

如果您只想找出是哪一列是罪魁祸首，那么只需编写一个循环，一次写一列并存储哪些列会失败...

bad_cols = []
for i in range(df.shape[1]):
    try:
        df.iloc[:, [i]].to_parquet(...)
    except KeyboardInterrupt:
        raise
    except Exception:  # you may want to catch ArrowInvalid exceptions instead
        bad_cols.append(i)
print(bad_cols)

在熊猫数据框中查找嵌套的列

4 个答案:

将嵌套结构铸造为字符串

检查最新版本的pyarrow

功能

行动中。