迭代DataFrame

时间:2017-01-19 10:14:38

标签: scala apache-spark apache-spark-sql

有数据框:

import sqlContext.implicits._

case class TestData(banana: String, orange: String, apple : String, feijoa: String)

var data = sc.parallelize((1 to 5).map(i => TestData("banana="+i.toString,
                    "orange="+i.toString,"apple="+i.toString,"feijoa="+i.toString))).toDF

data.registerTempTable("data")
data.show 

如下所示:

+--------+--------+-------+--------+
|  banana|  orange|  apple|  feijoa|
+--------+--------+-------+--------+
|banana=1|orange=1|apple=1|feijoa=1|
|banana=2|orange=2|apple=2|feijoa=2|
|banana=3|orange=3|apple=3|feijoa=3|
|banana=4|orange=4|apple=4|feijoa=4|
|banana=5|orange=5|apple=5|feijoa=5|
+--------+--------+-------+--------+

此外,还有一个sorted的{​​{1}}列表:

results

我想迭代case class result(fruits: Set[String], weight: Double) val results = List( result(Set("banana=1"), 200), result(Set("banana=3", "orange=3"), 180), result(Set("banana=2", "orange=2", "apple=3"), 170) ) ,将单个results与数据框中的行进行比较,并在适当的列中设置result,如果行1为特定contains }

更新:数据框中的每一列只包含一个值,例如result。由这些值组成的banana = 1集。

1)我知道如何迭代结果:

result.fruits

2)我知道如何按(0 to results.size-1) .map(i => results(i).fruits)

的大小向数据框添加列
results

3)我需要帮助来了解如何合并检查特定data = (1 to results.size) .par .foldLeft(data){ case(data,i) => data.withColumn(i.toString(),lit(0) ) } +--------+--------+-------+--------+-+-+-+ | banana| orange| apple| feijoa|1|2|3| +--------+--------+-------+--------+-+-+-+ |banana=1|orange=1|apple=1|feijoa=1|0|0|0| |banana=2|orange=2|apple=2|feijoa=2|0|0|0| |banana=3|orange=3|apple=3|feijoa=3|0|0|0| |banana=4|orange=4|apple=4|feijoa=4|0|0|0| |banana=5|orange=5|apple=5|feijoa=5|0|0|0| +--------+--------+-------+--------+-+-+-+ 是否包含select的{​​{1}}函数,然后在适当的列中将值设置为row :首先来自result.fruits列中的1,来自results列中#1列表中的第二个等等

1 个答案:

答案 0 :(得分:1)

尝试这样的事情(给出简单的解决方案,但你可以概括一下):

data = data.withColumn("combined", array($"banana",$"orange", $"apple",$"feijoa"))
def getFunc(resultSet: Set[String]) = {
    def f(x: Seq[String]): Int = {
        if(resultSet.forall(y=>x.contains(y))) 1 else 0
    }
    udf(f _)
}

data =(1 to results.size).foldLeft(data){
  (x,i) => x.withColumn(i.toString, getFunc(results(i-1).fruits)($"combined"))
}