我执行以下操作,如:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd._
import org.apache.spark.sql.SparkSession
import spark.implicits._
val spark = SparkSession.builder().appName("SameInterest").getOrCreate()
val d1 = spark.read.json ("/path/data1").select("Name","Interest").createOrReplaceTempView("d1_sql")
val d2 = spark.read.json ("/path/data2").select("Name","Interest").createOrReplaceTempView("d2_sql")
val sql_script = "SELECT d1_sql.Name as Name , d1_sql.Interest as Interest1 , d2_sql.Interest as Interest2 FROM d1_sql, d2_sql WHERE d1_sql.Name = d2_sql.Name"
val dosql = spark.sql(sql_script)
val sameIP_UU = dosql.rdd.filter(X => Array(X(1)).intersect(Array(X(2))).length>0)
我希望使用d1和d2的intersect
列进行Interest
,但我无法得到正确答案。
数据和架构是:
{"name":"John","Interest1":{"bag_0":[{"Interest":"110"},{"Interest":"220"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}}
{"name":"Allen","Interest1":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]},"Interest2":{"bag_0":[{"Interest":"111"},{"Interest":"222"},{"Interest":"333"}]}}
printSchema():
|-- Name: string (nullable = true)
|-- Interest1: struct (nullable = true)
| |-- bag_0: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Interest: string (nullable = true)
|-- Interest2: struct (nullable = true)
| |-- bag_0: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Interest: string (nullable = true)
我认为答案必须是2,但我总能得到答案1.
我发现数据结构有一个WrappedArray: [WrappedArray([110],[220],[333])]
这可能是我得错答案的原因,但我不知道如何从WrappedArray获取值并使用intersect
修改
dosql.take(1)
res47: Array[org.apache.spark.sql.Row] = Array([John,[WrappedArray([110], [220], [333])],[WrappedArray([110], [220], [333])]])
答案 0 :(得分:0)
应该是:
import org.apache.spark.sql.Row
dosql.rdd.filter {
row =>
row.getSeq[Row]("Interest1").intersect(row.getSeq[Row]("Interest1")).size > 0
}