如何根据字典中定义的组合为特定列创建唯一ID

时间:2020-08-13 06:00:35

标签: pyspark

我想基于具有多个组的数据框的列创建唯一的ID。在字典中,我为每个组定义了一个ID。如何根据我创建的字典将此ID添加到此数据框。

下面是示例数据和代码

conf_1 = {
    'cat':{'1': ['A_10','A_13'], 
           '2': ['B_8','B_4'],
           '3': ['A_11','A_13'],
               },
}

testlist = [
             {"cat":"A_10","val":10}, 
             {"cat":"A_13","val":11}, 
             {"cat":"B_8","val":12},
             {"cat":"B_4","val":14},
            {"cat":"A_11","val":9},
            {"cat":"A_13","val":16},
]

spark_df = spark.createDataFrame(testlist)

df = []
for i in conf_1['cat']:
  testlist.filter((f.col('cat').isin(i)) | (f.col('cat').isin(i)))
  .withColumn("id", monotonically_increasing_id())

最终输出应如下所示

enter image description here

0 个答案:

没有答案