从pandas数据帧解析yaml,最初从spark数据帧解析

时间:2020-11-09 19:24:13

标签: pandas pyspark apache-spark-sql yaml pyspark-dataframes

我将下面的pyspark代码作为etl管道的一部分。我目前正在使用AWS胶水运行它,但是我试图将其重写为纯pyspark。首先将数据导入为spark数据框。然后转换为熊猫数据框。然后从yaml解析为新列。当我用胶水将其从d​​atacatalog导入时,它会正确解析yaml。当我直接从实木复合地板文件导入时,它会创建额外的列。您知道什么会导致此问题,以及如何解决该问题。下面有示例代码和输出,首先显示了如何从aws胶分析出来,然后显示了使用纯pyspark的示例。

# from glue

slots_df= glueContext.create_dynamic_frame.from_catalog(
                     database='slotting',
                     table_name='slot_production_slots').toDF()

pd_slots_df=slots_df.toPandas()

import yaml

parsed_record = yaml.load(pd_slots_df['query_builder'][1], Loader=yaml.BaseLoader)

pd.io.json.json_normalize(parsed_record)

输出:

               field          name operator    type         value
0          priceunit    Price Unit  between  number      [0, 100]
1           abc_perc         abc %  between  number          [, ]
2        producttype  Product Type       in  string     [EXTRACT]
3     productsubtype      Sub Type       in  string  [LIVE SHOW ]
4       sp_brandname         Brand       in  string  [RAW FOOD  ]
5        productname          Name       in  string            []
6               size          Size       in  string            []
7  productattributes    Attributes     like  string            []




# from pyspark

slots_df=sqlContext.read.parquet(base_path+'slotting/slot_production.slots/')

pd_slots_df=slots_df.toPandas()

parsed_record = yaml.load(pd_slots_df['query_builder'][1], Loader=yaml.BaseLoader)

tst_df=pd.io.json.json_normalize(parsed_record).copy()


tst_df

输出:

           name           field    type  ...   :type :operator :value
0    Price Unit       priceunit  number  ...     NaN       NaN    NaN
1           NaN             NaN     NaN  ...  number   between   [, ]
2  Product Type     producttype  string  ...     NaN       NaN    NaN
3      Sub Type  productsubtype  string  ...     NaN       NaN    NaN
4         Brand    sp_brandname  string  ...     NaN       NaN    NaN
5          Name     productname  string  ...     NaN       NaN    NaN
6          Size            size  string  ...     NaN       NaN    NaN
7           NaN             NaN     NaN  ...  string      like     []

[8 rows x 10 columns]
<string>:5: FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead

0 个答案:

没有答案