我将下面的pyspark代码作为etl管道的一部分。我目前正在使用AWS胶水运行它,但是我试图将其重写为纯pyspark。首先将数据导入为spark数据框。然后转换为熊猫数据框。然后从yaml解析为新列。当我用胶水将其从datacatalog导入时,它会正确解析yaml。当我直接从实木复合地板文件导入时,它会创建额外的列。您知道什么会导致此问题,以及如何解决该问题。下面有示例代码和输出,首先显示了如何从aws胶分析出来,然后显示了使用纯pyspark的示例。
# from glue
slots_df= glueContext.create_dynamic_frame.from_catalog(
database='slotting',
table_name='slot_production_slots').toDF()
pd_slots_df=slots_df.toPandas()
import yaml
parsed_record = yaml.load(pd_slots_df['query_builder'][1], Loader=yaml.BaseLoader)
pd.io.json.json_normalize(parsed_record)
输出:
field name operator type value
0 priceunit Price Unit between number [0, 100]
1 abc_perc abc % between number [, ]
2 producttype Product Type in string [EXTRACT]
3 productsubtype Sub Type in string [LIVE SHOW ]
4 sp_brandname Brand in string [RAW FOOD ]
5 productname Name in string []
6 size Size in string []
7 productattributes Attributes like string []
# from pyspark
slots_df=sqlContext.read.parquet(base_path+'slotting/slot_production.slots/')
pd_slots_df=slots_df.toPandas()
parsed_record = yaml.load(pd_slots_df['query_builder'][1], Loader=yaml.BaseLoader)
tst_df=pd.io.json.json_normalize(parsed_record).copy()
tst_df
输出:
name field type ... :type :operator :value
0 Price Unit priceunit number ... NaN NaN NaN
1 NaN NaN NaN ... number between [, ]
2 Product Type producttype string ... NaN NaN NaN
3 Sub Type productsubtype string ... NaN NaN NaN
4 Brand sp_brandname string ... NaN NaN NaN
5 Name productname string ... NaN NaN NaN
6 Size size string ... NaN NaN NaN
7 NaN NaN NaN ... string like []
[8 rows x 10 columns]
<string>:5: FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead