免责声明:我对这两个主题(python和镶木地板)都非常陌生,所以如果我的想法很复杂,请与我保持联系。
我正在寻找有关如何以最有效的方式最好地完成以下转换的指南:
我有一个平坦的镶木地板文件,其中一个varchar列将JSON数据存储为字符串,并且我想将此数据转换为嵌套结构,即JSON数据成为嵌套镶木地板。如果有帮助,我会提前知道JSON的架构。
这是我到目前为止“成就”的内容:
构建样本数据
# load packages
import pandas as pd
import json
import pyarrow as pa
import pyarrow.parquet as pq
# Create dummy data
# dummy data with JSON as string
person_data = {'Name': ['Bob'],
'Age': [25],
'languages': "{'mother_language': 'English', 'other_languages': ['German', 'French']}"
}
# from dict to panda df
person_df = pd.DataFrame.from_dict(person_data)
# from panda df to pyarrow table
person_pat = pa.Table.from_pandas(person_df)
# save as parquet file
pq.write_table(person_pat, 'output/example.parquet')
脚本建议
# load dummy data
sample = pa.parquet.read_table('output/example.parquet')
# transform to dict
sample_dict = sample.to_pydict()
# print with indent for checking
print(json.dumps(sample_dict, sort_keys=True, indent=4))
# load json from string and replace string
sample_dict['languages'] = json.loads(str(sample_dict['languages']))
print(json.dumps(sample_dict, sort_keys=True, indent=4))
#type(sample_dict['languages'])
# how to keep the nested structure when going from dict —> panda df —> pyarrow table?
# save dict as nested parquet...
所以,我这里是我的具体问题:
非常感谢 斯蒂芬
答案 0 :(得分:4)
PySpark可以通过下面显示的简单方法来实现。使用PySpark的主要好处是随数据增长而增加了基础结构的可伸缩性,但是使用纯Python可能会出现问题,就好像您不使用Dask这样的框架一样,您将需要更大的机器来运行它。
from pyspark.sql import HiveContext
hc = HiveContext(sc)
# This is a way to create a PySpark dataframe from your sample, but there are others
nested_df = hc.read.json(sc.parallelize(["""
{'Name': ['Bob'],
'Age': [25],
'languages': "{'mother_language': 'English', 'other_languages': ['German', 'French']}"
}
"""]))
# You have nested Spark dataframe here. This shows the content of the spark dataframe. 20 is the max number of rows to show on the console and False means don't cut the columns that don't fit on the screen (show all columns content)
nested_df.show(20,False)
# Writes to a location as parquet
nested_df.write.parquet('/path/parquet')
# Reads the file from the previous location
spark.read.parquet('/path/parquet').show(20, False)
此代码的输出为
+----+-----+-----------------------------------------------------------------------+
|Age |Name |languages |
+----+-----+-----------------------------------------------------------------------+
|[25]|[Bob]|{'mother_language': 'English', 'other_languages': ['German', 'French']}|
+----+-----+-----------------------------------------------------------------------+
+----+-----+-----------------------------------------------------------------------+
|Age |Name |languages |
+----+-----+-----------------------------------------------------------------------+
|[25]|[Bob]|{'mother_language': 'English', 'other_languages': ['German', 'French']}|
+----+-----+-----------------------------------------------------------------------+
回答您的问题