如何将一个数据框循环到另一数据框并获取pyspark中的单个匹配记录

时间:2020-07-02 03:01:35

标签: python apache-spark pyspark

**数据框1 **

 +----+------+------+-----+-----+
 |key  |dc_count|dc_day_count   |
 +----+------+------+-----+-----+
 | 123 |13      |66             |
 | 123 |13      |12             |
 +----+------+------+-----+-----+

        

**规则数据框**

 +----+------+------+-----+-----++------+-----+-----+
 |key  |rule_dc_count|rule_day_count   |rule_out    | 
 +----+------+------+-----+-----++------+-----+-----+
 | 123 |2            |30               |139         |
 | 123 |null         |null             |64          |
 | 124 |2            |30               |139         |
 | 124 |null         |null             |64          |
 +----+------+------+-----+-----+----+------+-----+--

如果dc_count> rule_dc_count和dc_day_count> rule_day_count 填充相应的rule_out
其他 其他rule_out”

预期产量

 +----+------+------+-
 |key  |rule_out    | 
 +----+------+------+
 | 123 | 139        |
 | 124 |  64        |
 +----+------+------+

2 个答案:

答案 0 :(得分:0)

假设预期输出为-

+---+--------+
|key|rule_out|
+---+--------+
|123|139     |
+---+--------+

下面的查询应该可以工作-

spark.sql(
      """
        |SELECT
        | t1.key, t2.rule_out
        |FROM table1 t1 join table2 t2 on t1.key=t2.key and
        |t1.dc_count > t2.rule_dc_count and t1.dc_day_count > t2.rule_day_count
      """.stripMargin)
      .show(false)

答案 1 :(得分:0)

PySpark版本

这里的挑战是获取同一列中键的第二行值,为了解决这个LEAD()分析函数,可以使用它。

在此处创建数据框

from pyspark.sql import functions as F
df = spark.createDataFrame([(123,13,66),(124,13,12)],[ "key","dc_count","dc_day_count"])
df1 = spark.createDataFrame([(123,2,30,139),(123,0,0,64),(124,2,30,139),(124,0,0,64)],
                            ["key","rule_dc_count","rule_day_count","rule_out"])

获取所需结果的逻辑

from pyspark.sql import Window as W
_w = W.partitionBy('key').orderBy(F.col('key').desc())
df1 = df1.withColumn('rn', F.lead('rule_out').over(_w))
df1 = df1.join(df,'key','left')
df1 = df1.withColumn('condition_col', 
                     F.when(
  (F.col('dc_count') > F.col('rule_dc_count')) & 
  (F.col('dc_day_count') > F.col('rule_day_count')),F.col('rule_out'))
                     .otherwise(F.col('rn')))

df1 = df1.filter(F.col('rn').isNotNull())

输出

df1.show()
+---+-------------+--------------+--------+---+--------+------------+-------------+
|key|rule_dc_count|rule_day_count|rule_out| rn|dc_count|dc_day_count|condition_col|
+---+-------------+--------------+--------+---+--------+------------+-------------+
|124|            2|            30|     139| 64|      13|          12|           64|
|123|            2|            30|     139| 64|      13|          66|          139|
+---+-------------+--------------+--------+---+--------+------------+-------------+