Pyspark-从具有最小值和最大值范围的数组中获取值

时间:2019-05-15 13:21:41

标签: json pyspark databricks azure-databricks

我正在尝试在PySpark中编写一个查询,该查询将从数组中获取正确的值。

例如,我有一个名为df的数据框,其中包含三列“ companyId”,“ companySize”和“ weightingRange”。 “ companySize”列仅是员工人数。 “ weightingRange”列是一个数组,其中包含以下内容

[ {"minimum":0, "maximum":100, "weight":123},
  {"minimum":101, "maximum":200, "weight":456},
  {"minimum":201, "maximum":500, "weight":789}
]

所以datframe看起来像这样(weightingRange与上面相同,在下面的示例中被截断以更清晰地显示格式)

+-----------+-------------+------------------------+--+
| companyId | companySize |     weightingRange     |  |
+-----------+-------------+------------------------+--+
| ABC1      |         150 | [{"maximum":100, etc}] |  |
| ABC2      |          50 | [{"maximum":100, etc}] |  |
+-----------+-------------+------------------------+--+

对于公司规模= 150的条目,我需要将权重456返回到名为“ companyWeighting”的列中。

因此它应该显示以下内容

+-----------+-------------+------------------------+------------------+
| companyId | companySize |     weightingRange     | companyWeighting |
+-----------+-------------+------------------------+------------------+
| ABC1      |         150 | [{"maximum":100, etc}] |              456 |
| ABC2      |          50 | [{"maximum":100, etc}] |              123 |
+-----------+-------------+------------------------+------------------+

我看过

df.withColumn("tmp",explode(col("weightingRange"))).select("tmp.*")

然后加入但尝试应用将对数据进行笛卡尔坐标

建议表示赞赏!

1 个答案:

答案 0 :(得分:1)

您可以这样处理

首先创建一个示例数据框,

import pyspark.sql.functions as F

df = spark.createDataFrame([
        ('ABC1', 150, [ {"min":0, "max":100, "weight":123},
                        {"min":101, "max":200, "weight":456},
                        {"min":201, "max":500, "weight":789}]),
        ('ABC2', 50, [  {"min":0, "max":100, "weight":123},
                        {"min":101, "max":200, "weight":456},
                        {"min":201, "max":500, "weight":789}])],  

        ['companyId' , 'companySize', 'weightingRange'])

然后,创建一个udf函数并将其应用于每一行以获取新列,

def get_weight(wt,wt_rnge):
    for _d in wt_rnge:
        if _d['min'] <= wt <= _d['max']:
            return _d['weight']

get_weight_udf = F.udf(lambda x,y: get_weight(x,y))
df = df.withColumn('companyWeighting', get_weight_udf(F.col('companySize'), F.col('weightingRange')))
df.show()

您得到的输出为

+---------+-----------+--------------------+----------------+
|companyId|companySize|      weightingRange|companyWeighting|
+---------+-----------+--------------------+----------------+
|     ABC1|        150|[Map(weight -> 12...|             456|
|     ABC2|         50|[Map(weight -> 12...|             123|
+---------+-----------+--------------------+----------------+