检查一列的值是否位于数据框中另一列(数组)的范围之间

时间:2019-05-27 11:56:05

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我有一个数据框,需要比较一些值并从中推断出一些东西。

例如

我的DF

CITY DAY MONTH TAG RANGE     VALUE  RANK
A    1    01    A   [50, 90]   55     1
A    2    02    B   [30, 40]   34     3
A    1    03    A   [05, 10]   15    20
A    1    04    B   [50, 60]   11    10 
A    1    05    B   [50, 60]   54    4 

我必须为每一行检查“ VALUE”的值是否位于“ RANGE”之间。此处,arr [0]是下限,arr [1]是上限。

我需要创建一个新的DF,

NEW-DF

TAG  Positive  Negative
A     1          1
B     2          1 
  1. 如果“值”位于给定范围和等级<5之间,那么我会将其添加到“正”

  2. 如果该值不在给定范围内,则为负值

  3. 如果该值在给定范围内,但等级> 5,则将其计为负数

“正”和“负”不过是满足两个条件的值的计数。

2 个答案:

答案 0 :(得分:1)

我们可以使用element_at获取每个位置的元素,并将它们与每一行中的相应值以及排名条件进行比较,然后对groupby执行sum在标签上:

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

range_df = df.withColumn('in_range', (F.element_at('range', 1).cast(IntegerType()) < F.col('value')) & 
                                     (F.col('value') < F.element_at('range', 2).cast(IntegerType())) &
                                     (F.col('rank') < 5))

range_df.show()

grouped_df = range_df.groupby('tag').agg(F.sum(F.col('in_range').cast(IntegerType())).alias('total_positive'), 
                                         F.sum((~F.col('in_range')).cast(IntegerType())).alias('total_negative'))

grouped_df.show()

输出:

+---+--------+-----+----+--------+
|tag|   range|value|rank|in_range|
+---+--------+-----+----+--------+
|  A|[50, 90]|   55|   1|    true|
|  B|[30, 40]|   34|   3|    true|
|  A|[05, 10]|   15|  20|   false|
|  B|[50, 60]|   11|  10|   false|
|  B|[50, 60]|   54|   4|    true|
+---+--------+-----+----+--------+

+---+--------------+--------------+
|tag|total_positive|total_negative|
+---+--------------+--------------+
|  B|             2|             1|
|  A|             1|             1|
+---+--------------+--------------+

答案 1 :(得分:0)

您必须首先使用UDF处理范围:

val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")

+----+---+-----+---+-------+-----+----+
|city|day|month|tag|  range|value|rank|
+----+---+-----+---+-------+-----+----+
|   A|  1|   01|  A|[50,90]|   55|   1|
+----+---+-----+---+-------+-----+----+


  def checkRange(range : String,rank : String, value : String) : String = {
    val rangeProcess = range.dropRight(1).drop(1).split(",")
    if (rank.toInt > 5){
      "negative"
    } else {
      if (value > rangeProcess(0) && value < rangeProcess(1)){
        "positive"
      } else {
        "negative"
      }
    }
  }

  val checkRangeUdf = udf(checkRange _)

df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()

+----+---+-----+---+-------+-----+----+--------+
|city|day|month|tag|  range|value|rank|  Result|
+----+---+-----+---+-------+-----+----+--------+
|   A|  1|   01|  A|[50,90]|   55|   1|positive|
+----+---+-----+---+-------+-----+----+--------+


val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show

+----+--------+-----+
|city|  Result|count|
+----+--------+-----+
|   A|positive|    1|
+----+--------+-----+