在Apache Spark 2.1.0中使用DataFrame上的Except

时间:2017-04-19 00:08:10

标签: scala apache-spark dataframe

除了在Spark DataFrames上正常工作吗?

在Spark shell中,我创建了一个简单的DataFrame,其中包含三个字符串:“a”,“b”,“c”。限制(1)分配给row1,正确产生数组([a])。然后将row1用作grfDF DataFrame上的extend方法的参数,从而产生tail1。 tail1不应该是数组的新数据框架([b],[c])吗?

为什么tail1仍然包含“a”并删除了“b”?

scala> grfDF.collect
res1: Array[org.apache.spark.sql.Row] = Array([a], [b], [c])                   

scala> val row1 = grfDF.limit(1)
row1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [sub: string]

scala> row1.collect
res3: Array[org.apache.spark.sql.Row] = Array([a])

scala> val tail1 = grfDF.except(row1).collect
tail1: Array[org.apache.spark.sql.Row] = Array([c], [a])

DataFrame创建如下:

    case class Grf(sub: String)
    def toGrf = (grf: Seq[String]) => Grf(grf(0))
    val sourceList = Array("a", "b", "c")
    val grfRDD = sc.parallelize(sourceList).map(_.split(",")).map(toGrf(_))
    val grfDF = spark.createDataFrame(grfRDD)
    grfDF.createOrReplaceTempView("grf")

然后我尝试弹出一行:

    val row1 = grfDF.limit(1)
    row1.collect 
    val tail1 = grfDF.except(row1)
    tail1.collect

1 个答案:

答案 0 :(得分:0)

我试着在火花壳中做类似的事情。请再次尝试相同的代码,因为我得到的结果是数组([b],[c])。请参阅以下代码:

scala> val sourceList=Array("a","b","c")
sourceList: Array[String] = Array(a, b, c)

scala> val grfRDD = sc.parallelize(sourceList)
grfRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:29

val grfDF = grfRDD.toDF()
grfDF: org.apache.spark.sql.DataFrame = [_1: string]

scala> grfDF
res0: org.apache.spark.sql.DataFrame = [_1: string]

scala> val row1 = grfDF.limit(1)
row1: org.apache.spark.sql.DataFrame = [_1: string]

scala> row1
res1: org.apache.spark.sql.DataFrame = [_1: string]

row1.collect()
res2: Array[org.apache.spark.sql.Row] = Array([a])

scala> val tail = grfDF.except(row1)
tail: org.apache.spark.sql.DataFrame = [_1: string]

scala> tail.collect()
res6: Array[org.apache.spark.sql.Row] = Array([b], [c])