我想获得具有以下内容的DataSet:
111,Array([123,1],[222,3]
222,Array([333,3],[444,3]
这是我的Spark 2.2.0和Scala 2.11的代码:
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("spark.sql.warehouse.dir", inputPath)
.enableHiveSupport()
.getOrCreate()
val df = spark.read.parquet(inputPath)
df.createOrReplaceTempView("sample_data")
val rows = spark.sql("SELECT * FROM sample_data")
val result = rows.map{ row: Row => {
val pk = row.get(row.fieldIndex("pk")).toString.toLong
val r = spark.sql("SELECT pk FROM sample_data WHERE pk != " + pk)
val productList = r.rdd.map(r => r(0).toString.toLong).collect()
(row.get(row.fieldIndex("pk")).toString.toLong, productList)
}}
但是我收到了这个错误:
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[ERROR] val result = rows.map{ row: Row => {
我尝试导入sqlContext.implicits._
,但它无法编译。
在maven我有这种依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
更新
最后我按如下方式导入了implicits:import spark.implicits._
,但是我在运行时遇到了这个错误:
java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:128)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
at org.test.Compute$$anonfun$1.apply(ComputeNumSim.scala:68)
at org.test.Compute$$anonfun$1.apply(ComputeNumSim.scala:61)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
答案 0 :(得分:0)
您不能将RDD
存储在Dataset
或其他分布式集合中的任何其他分布式集合中。您甚至无法访问DataFrame
中的map
,但Datasets
与本地馆藏支持的join
相关。
在这种情况下,您应Datasets
按rows.alias("rows").join(
spark.table("sample_data").alias("sample"),
$"rows.pk" =!= $"sample.pl"
)
按键:
rows.alias("rows")
.crossJoin(spark.table("sample_data").alias("sample"))
.where($"rows.pk" =!= $"sample.pk")
或更明确地
function all_users_exist(array $user_ids, Mysqli $con)
{
$result_uids = [];
$ids = implode(',', $user_ids);
$sql = "SELECT uid from users WHERE uid IN ($ids)";
$result = $con->query($sql);
if($result)
while ($row = $result->fetch_array(MYSQLI_ASSOC))
$result_uids[] = $row['uid'];
sort($user_ids);
sort($result_uids);
return $user_ids == $result_uids;
}