Question

我在Azure中使用带有spark 1.6的hortonworks沙箱。我有一个填充了TCP-DS示例数据的Hive数据库。我想从外部文件中读取一些SQL查询，并在spark中的hive数据集上运行它们。我遵循这个主题Using hive database in spark，它只是在我的数据集中使用一个表，并且它再次在spark中写入SQL查询，但我需要定义整个数据集作为我的查询源，我想我应该使用数据帧但我不确定，也不知道怎么做！我也想从外部.sql文件导入SQL查询，不要再写下查询！你能指导我怎么做？非常感谢你，最好成绩！

Answer 1

Spark可以直接从Hive表中读取数据。您可以使用Spark创建，删除Hive表，甚至可以通过Spark执行所有与Hive hql相关的操作。为此，您需要使用Spark HiveContext

来自Spark文档：

Spark HiveContext，提供了基本SQLContext提供的功能的超集。其他功能包括使用更完整的HiveQL解析器编写查询，访问Hive UDF以及从Hive表读取数据的功能。要使用HiveContext，您不需要现有的Hive设置。

有关详细信息，请访问Spark Documentation

为避免在代码中编写sql，您可以使用属性文件放置所有Hive查询，然后就可以在代码中使用密钥。

请参阅下面的Spark HiveContext的实现以及Spark Scala中属性文件的使用。

package com.spark.hive.poc

import org.apache.spark._
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame;
import org.apache.spark.rdd.RDD;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.hive.HiveContext;

//Import Row.
import org.apache.spark.sql.Row;
//Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType };

object ReadPropertyFiles extends Serializable {

  val conf = new SparkConf().setAppName("read local file");

  conf.set("spark.executor.memory", "100M");
  conf.setMaster("local");

  val sc = new SparkContext(conf)
  val sqlContext = new HiveContext(sc)

  def main(args: Array[String]): Unit = {

    var hadoopConf = new org.apache.hadoop.conf.Configuration();
    var fileSystem = FileSystem.get(hadoopConf);
    var Path = new Path(args(0));
    val inputStream = fileSystem.open(Path);
    var Properties = new java.util.Properties;
    Properties.load(inputStream);

    //Create an RDD
    val people = sc.textFile("/user/User1/spark_hive_poc/input/");
    //The schema is encoded in a string
    val schemaString = "name address";

    //Generate the schema based on the string of schema
    val schema =
      StructType(
        schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));

    //Convert records of the RDD (people) to Rows.
    val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim));
    //Apply the schema to the RDD.
    val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
    peopleDataFrame.printSchema();

    peopleDataFrame.registerTempTable("tbl_temp")

    val data = sqlContext.sql(Properties.getProperty("temp_table"));

    //Drop Hive table
    sqlContext.sql(Properties.getProperty("drop_hive_table"));
    //Create Hive table
    sqlContext.sql(Properties.getProperty("create_hive_tavle"));
    //Insert data into Hive table
    sqlContext.sql(Properties.getProperty("insert_into_hive_table"));
    //Select Data into Hive table
    sqlContext.sql(Properties.getProperty("select_from_hive")).show();

    sc.stop

  }
}

在属性文件中输入：

temp_table=select * from tbl_temp
drop_hive_table=DROP TABLE IF EXISTS default.test_hive_tbl
create_hive_tavle=CREATE TABLE IF NOT EXISTS default.test_hive_tbl(name string, city string) STORED AS ORC
insert_into_hive_table=insert overwrite table default.test_hive_tbl select * from tbl_temp
select_from_hive=select * from default.test_hive_tbl

Spark submit命令运行此作业：

[User1@hadoopdev ~]$ spark-submit --num-executors 1 \
--executor-memory 100M --total-executor-cores 2 --master local \
--class com.spark.hive.poc.ReadPropertyFiles Hive-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
/user/User1/spark_hive_poc/properties/sql.properties

注意：属性文件位置应为HDFS位置。

如何在spark中使用整个hive数据库并从外部文件中读取sql查询？

1 个答案: