Spark与WrappedArray相交

时间:2017-04-13 13:15:09

标签: scala apache-spark

我执行以下操作,如:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd._
import org.apache.spark.sql.SparkSession

import spark.implicits._

val spark = SparkSession.builder().appName("SameInterest").getOrCreate()

val d1 = spark.read.json ("/path/data1").select("Name","Interest").createOrReplaceTempView("d1_sql")
val d2 = spark.read.json ("/path/data2").select("Name","Interest").createOrReplaceTempView("d2_sql")

val sql_script = "SELECT d1_sql.Name as Name , d1_sql.Interest as Interest1 , d2_sql.Interest as Interest2 FROM d1_sql, d2_sql WHERE d1_sql.Name = d2_sql.Name"

val dosql = spark.sql(sql_script)

val sameIP_UU = dosql.rdd.filter(X => Array(X(1)).intersect(Array(X(2))).length>0)

我希望使用d1和d2的intersect列进行Interest,但我无法得到正确答案。

数据和架构是:

{"name":"John","Interest1":{"bag_0":[{"Interest":"110"},{"Interest":"220"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}}
{"name":"Allen","Interest1":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}}

printSchema():

 |-- Name: string (nullable = true)
  |-- Interest1: struct (nullable = true)
  |    |-- bag_0: array (nullable = true)
  |    |    |-- element: struct (containsNull = true)
  |    |    |    |-- Interest: string (nullable = true)
  |-- Interest2: struct (nullable = true)
  |    |-- bag_0: array (nullable = true)
  |    |    |-- element: struct (containsNull = true)
  |    |    |    |-- Interest: string (nullable = true)

我认为答案必须是2,但我总能得到答案1.

我发现数据结构有一个WrappedArray:     [WrappedArray([110],[220],[333])]

这可能是我得错答案的原因,但我不知道如何从WrappedArray获取值并使用intersect

修改

dosql.take(1)
res47: Array[org.apache.spark.sql.Row] = Array([John,[WrappedArray([110], [220], [333])],[WrappedArray([110], [220], [333])]])

1 个答案:

答案 0 :(得分:0)

应该是:

import org.apache.spark.sql.Row

dosql.rdd.filter {
  row => 
  row.getSeq[Row]("Interest1").intersect(row.getSeq[Row]("Interest1")).size > 0
}