我有一个数据框,需要比较一些值并从中推断出一些东西。
例如
我的DF
CITY DAY MONTH TAG RANGE VALUE RANK
A 1 01 A [50, 90] 55 1
A 2 02 B [30, 40] 34 3
A 1 03 A [05, 10] 15 20
A 1 04 B [50, 60] 11 10
A 1 05 B [50, 60] 54 4
我必须为每一行检查“ VALUE”的值是否位于“ RANGE”之间。此处,arr [0]是下限,arr [1]是上限。
我需要创建一个新的DF,
NEW-DF
TAG Positive Negative
A 1 1
B 2 1
如果“值”位于给定范围和等级<5之间,那么我会将其添加到“正”
如果该值不在给定范围内,则为负值
如果该值在给定范围内,但等级> 5,则将其计为负数
“正”和“负”不过是满足两个条件的值的计数。
答案 0 :(得分:1)
我们可以使用element_at
获取每个位置的元素,并将它们与每一行中的相应值以及排名条件进行比较,然后对groupby
执行sum
在标签上:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
range_df = df.withColumn('in_range', (F.element_at('range', 1).cast(IntegerType()) < F.col('value')) &
(F.col('value') < F.element_at('range', 2).cast(IntegerType())) &
(F.col('rank') < 5))
range_df.show()
grouped_df = range_df.groupby('tag').agg(F.sum(F.col('in_range').cast(IntegerType())).alias('total_positive'),
F.sum((~F.col('in_range')).cast(IntegerType())).alias('total_negative'))
grouped_df.show()
输出:
+---+--------+-----+----+--------+
|tag| range|value|rank|in_range|
+---+--------+-----+----+--------+
| A|[50, 90]| 55| 1| true|
| B|[30, 40]| 34| 3| true|
| A|[05, 10]| 15| 20| false|
| B|[50, 60]| 11| 10| false|
| B|[50, 60]| 54| 4| true|
+---+--------+-----+----+--------+
+---+--------------+--------------+
|tag|total_positive|total_negative|
+---+--------------+--------------+
| B| 2| 1|
| A| 1| 1|
+---+--------------+--------------+
答案 1 :(得分:0)
您必须首先使用UDF处理范围:
val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
+----+---+-----+---+-------+-----+----+
|city|day|month|tag| range|value|rank|
+----+---+-----+---+-------+-----+----+
| A| 1| 01| A|[50,90]| 55| 1|
+----+---+-----+---+-------+-----+----+
def checkRange(range : String,rank : String, value : String) : String = {
val rangeProcess = range.dropRight(1).drop(1).split(",")
if (rank.toInt > 5){
"negative"
} else {
if (value > rangeProcess(0) && value < rangeProcess(1)){
"positive"
} else {
"negative"
}
}
}
val checkRangeUdf = udf(checkRange _)
df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
+----+---+-----+---+-------+-----+----+--------+
|city|day|month|tag| range|value|rank| Result|
+----+---+-----+---+-------+-----+----+--------+
| A| 1| 01| A|[50,90]| 55| 1|positive|
+----+---+-----+---+-------+-----+----+--------+
val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
+----+--------+-----+
|city| Result|count|
+----+--------+-----+
| A|positive| 1|
+----+--------+-----+