我有像这样的cloudera集群的规范:
我创建简单的spark sql appliation来加入hive表。该表是外部表。对于healtpersonalcare_reviews表,数据写在json文件中,并且(1.7 GB)用于health_ratings表,数据以csv格式(115MB)写入。这是我的代码:
val warehouseLocation = "/hive/warehouse"
var args_list = args.toList
var conf = new SparkConf()
.set("spark.sql.warehouse.dir", warehouseLocation)
.set("spark.kryoserializer.buffer.max","1024m")
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config(conf)
.enableHiveSupport()
.getOrCreate()
val table_view_name = args_list(0)
val limit = args_list(1)
val df_addjar = spark.sql("ADD JAR /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar")
var df_use =spark.sql("use testing")
var df = spark.sql("SELECT hp.asin, hp.helpful,hp.overall,hp.reviewerid,hp.reviewername,hp.reviewtext,hp.reviewtime,hp.summary,hp.unixreviewtime FROM testing.healtpersonalcare_reviews hp LEFT JOIN testing.health_ratings hr ON (hp.reviewerid = hr.reviewerid) ")
var df_create_join_table = spark.sql("CREATE TABLE IF NOT EXISTS healtpersonalcare_joins (asin string,helpful array<int>,overall double,reviewerid string,reviewername string,reviewtext string,reviewtime string,summary string,unixreviewtime int)")
df.cache()
df.collect().foreach(println)
System.exit(0)
我使用此命令运行应用程序:
spark-submit --class org.sia.chapter03App.App --master yarn --deploy-mode client --executor-memory 1024m --driver-memory 1024m --conf spark.driver.maxResultSize = 2g - verbose /root/sparktest/original-chapter03App-0.0.1-SNAPSHOT.jar name 10
我尝试使用值的变体--executor-memory和--driver-memory
有没有人遇到过这样的问题?解决方案是什么? 谢谢。