Spark Dataframe中异常处理的非常具体的要求

时间:2018-02-06 10:52:27

标签: scala apache-spark apache-spark-sql

我对Spark Dataframe(Scala)中的离群值处理有非常具体的要求 我想只处理第一个异常值并使其等于第二个异常值。

输入:

+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A     |r1               |  da  |   
|A     |r1               |  ds  |
|A     |r1               |  s   |
|A     |r1               |  f   |
|A     |r1               |  v   |
|A     |r2               |  s   |
|A     |r2               |  s   |
|A     |r2               |  c   |
|A     |r3               |  s   |
|A     |r3               |  s   |
|A     |r4               |  s   |
|A     |r5               |  c   |
|A     |r6               |  s   |
|A     |r7               |  s   |
|A     |r8               |  s   |
+------+-----------------+------+

现在每个市场和responseVariable我想只处理第一个异常值..

每个市场的组和响应变量:

+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A     |r1               |  5   |   
|A     |r2               |  3   |
|A     |r3               |  2   |
|A     |r4               |  1   |
|A     |r5               |  1   |
|A     |r6               |  1   |
|A     |r7               |  1   |
|A     |r8               |  1   |
+------+-----------------+------+

我想在实际数据集中处理组市场= A和responseVariable = r1的异常值。我想从第1组中随机删除记录并使其等于第2组。

预期产出:

+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A     |r1               |  da  |   
|A     |r1               |  s   |
|A     |r1               |  v   |
|A     |r2               |  s   |
|A     |r2               |  s   |
|A     |r2               |  c   |
|A     |r3               |  s   |
|A     |r3               |  s   |
|A     |r4               |  s   |
|A     |r5               |  c   |
|A     |r6               |  s   |
|A     |r7               |  s   |
|A     |r8               |  s   |
+------+-----------------+------+

组:

+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A     |r1               |  3   |   
|A     |r2               |  3   |
|A     |r3               |  2   |
|A     |r4               |  1   |
|A     |r5               |  1   |
|A     |r6               |  1   |
|A     |r7               |  1   |
|A     |r8               |  1   |
+------+-----------------+------+

我想在多个市场重复这个问题。 enter image description here

1 个答案:

答案 0 :(得分:1)

您必须知道第一组和第二组计数和名称,可以按以下方式完成

import org.apache.spark.sql.functions._
val first_two_values = df.groupBy("market", "responseVariable").agg(count("blabla").as("count")).orderBy($"count".desc).take((2)).map(row => (row(1) -> row(2))).toList
val rowsToFilter = first_two_values(0)._1
val countsToFilter = first_two_values(1)._2

在您知道前两个组之后,您需要过滤掉第一组中的额外行,这可以通过生成row_number并过滤掉额外的行来完成,如下所示< / p>

import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("market","responseVariable").orderBy("blabla")

df.withColumn("rank", row_number().over(windowSpec))
  .withColumn("rank", when(col("rank") > countsToFilter && col("responseVariable") === rowsToFilter, false).otherwise(true))
  .filter(col("rank"))
  .drop("rank")
  .show(false)

您应该满足您的要求