Spark hive查询连接错误

时间:2017-02-21 07:03:47

标签: java apache-spark hive apache-spark-sql cloudera

我有像这样的cloudera集群的规范:

enter image description here

我创建简单的spark sql appliation来加入hive表。该表是外部表。对于healtpersonalcare_reviews表,数据写在json文件中,并且(1.7 GB)用于health_ratings表,数据以csv格式(115MB)写入。这是我的代码:

val warehouseLocation = "/hive/warehouse"
var args_list      = args.toList
var conf = new SparkConf()
  .set("spark.sql.warehouse.dir", warehouseLocation)
  .set("spark.kryoserializer.buffer.max","1024m")

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config(conf)
  .enableHiveSupport()
  .getOrCreate()


val table_view_name = args_list(0)
val limit = args_list(1)

val df_addjar = spark.sql("ADD JAR /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar")

var df_use =spark.sql("use testing")
var df = spark.sql("SELECT hp.asin, hp.helpful,hp.overall,hp.reviewerid,hp.reviewername,hp.reviewtext,hp.reviewtime,hp.summary,hp.unixreviewtime FROM testing.healtpersonalcare_reviews hp LEFT JOIN testing.health_ratings hr ON (hp.reviewerid = hr.reviewerid) ")
var df_create_join_table = spark.sql("CREATE TABLE IF NOT EXISTS healtpersonalcare_joins (asin string,helpful array<int>,overall double,reviewerid string,reviewername string,reviewtext string,reviewtime string,summary string,unixreviewtime int)")


df.cache()
df.collect().foreach(println)

System.exit(0)

我使用此命令运行应用程序:

spark-submit --class org.sia.chapter03App.App --master yarn --deploy-mode client --executor-memory 1024m --driver-memory 1024m --conf spark.driver.maxResultSize = 2g - verbose /root/sparktest/original-chapter03App-0.0.1-SNAPSHOT.jar name 10

我尝试使用值的变体--executor-memory和--driver-memory

  1. For&#34; - executor-memory 1024m --driver-memory 1024m&#34;我收到错误&#34; java.lang.OutOfMemoryError:Java堆空间&#34;
  2. For&#34; - executor-memory 2048m --driver-memory 2048m&#34;我在线程&#34;主要&#34;中得到错误&#34; java.lang.OutOfMemoryError:超出GC开销限制&#34;
  3. 有没有人遇到过这样的问题?解决方案是什么? 谢谢。

0 个答案:

没有答案