AWS Glue PySpark:将以字符串表示的字典拆分为多行

时间:2019-04-05 19:40:38

标签: python apache-spark pyspark apache-spark-sql aws-glue

我正在处理大数据集,我的记录具有以下格式

uniqueId col1 col2 col3  Levels
1    A1   A2   A3    {"2019-01-01":1.1 ,"2019-01-02":2.1 ,"2019-01-03":3.1}
2    B1   B2   B3    {"2019-01-01":1.2 ,"2019-01-03":3.2}
3    C1   C2   C3    {"2019-01-04":4.3}

“级别”存储为字符串类型。

我正在尝试将Levels分成几行,以便得到这样的输出:

uniqueId col1 col2 col3 date        value
1        A1   A2   A3   2019-01-01  1.2
1        A1   A2   A3   2019-01-02  2.1
1        A1   A2   A3   2019-01-03  3.1
2        B1   B2   B3   2019-01-01  1.2
2        B1   B2   B3   2019-01-03  3.2
3        C1   C2   C3   2019-01-04  4.3

我正在按照此处提出的解决方案在Pyspark的AWS Glue上应用脚本,

PySpark "explode" dict in column

@udf("map<string, string>")
def parse(s):
    try:
        return json.loads(s)
    except json.JSONDecodeError:
        pass 

parse_udf = udf(parse) 



datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database", table_name = "table", transformation_ctx = "datasource0")

sparkDF = datasource0.toDF() 

sparkDF2 = sparkDF.select("unique_id","col1","col2", "col3", explode(parse("levels")).alias("date", "value"))



GlueDF_tmp = DynamicFrame.fromDF(sparkDF2, glueContext, 'GlueDF_tmp')

GlueDF = GlueDF_tmp.apply_mapping([("unique_id", "string", "unique_id", "string"),
        ("col1", "string", "col1", "string"),
        ("col2", "string", "col2", "string"),
        ("col3", "string", "col3", "string"),
        ("date", "timestamp", "date", "timestamp"),
        ("value", "double", "value", "double")])


glueContext.write_dynamic_frame.from_options(frame = GlueDF, connection_type = "s3", 
     connection_options = {"path": "s3://..."}, 
     format = "parquet", 
     transformation_ctx = "datasink0")

但是我遇到这种类型的内存问题 AWS Glue - can't set spark.yarn.executor.memoryOverhead

什么是更好/更有效的拆分方式?

0 个答案:

没有答案