Question

我有像这样的cloudera集群的规范：

我创建简单的spark sql appliation来加入hive表。该表是外部表。对于healtpersonalcare_reviews表，数据写在json文件中，并且（1.7 GB）用于health_ratings表，数据以csv格式（115MB）写入。这是我的代码：

val warehouseLocation = "/hive/warehouse"
var args_list      = args.toList
var conf = new SparkConf()
  .set("spark.sql.warehouse.dir", warehouseLocation)
  .set("spark.kryoserializer.buffer.max","1024m")

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config(conf)
  .enableHiveSupport()
  .getOrCreate()


val table_view_name = args_list(0)
val limit = args_list(1)

val df_addjar = spark.sql("ADD JAR /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar")

var df_use =spark.sql("use testing")
var df = spark.sql("SELECT hp.asin, hp.helpful,hp.overall,hp.reviewerid,hp.reviewername,hp.reviewtext,hp.reviewtime,hp.summary,hp.unixreviewtime FROM testing.healtpersonalcare_reviews hp LEFT JOIN testing.health_ratings hr ON (hp.reviewerid = hr.reviewerid) ")
var df_create_join_table = spark.sql("CREATE TABLE IF NOT EXISTS healtpersonalcare_joins (asin string,helpful array<int>,overall double,reviewerid string,reviewername string,reviewtext string,reviewtime string,summary string,unixreviewtime int)")


df.cache()
df.collect().foreach(println)

System.exit(0)

我使用此命令运行应用程序：

spark-submit --class org.sia.chapter03App.App --master yarn --deploy-mode client --executor-memory 1024m --driver-memory 1024m --conf spark.driver.maxResultSize = 2g - verbose /root/sparktest/original-chapter03App-0.0.1-SNAPSHOT.jar name 10

我尝试使用值的变体--executor-memory和--driver-memory

For＆＃34; - executor-memory 1024m --driver-memory 1024m＆＃34;我收到错误＆＃34; java.lang.OutOfMemoryError：Java堆空间＆＃34;
For＆＃34; - executor-memory 2048m --driver-memory 2048m＆＃34;我在线程＆＃34;主要＆＃34;中得到错误＆＃34; java.lang.OutOfMemoryError：超出GC开销限制＆＃34;

有没有人遇到过这样的问题？解决方案是什么？谢谢。

Spark hive查询连接错误

0 个答案: