Question

是否可以将DataFrame直接保存到Hive中。

我尝试将DataFrame转换为Rdd，然后另存为文本文件，然后加载到配置单元中。但我想知道我是否可以直接将dataframe保存到配置单元

Answer 1

您可以使用sqlContext创建内存临时表并将其存储在hive表中。

让我们说你的数据框是myDf。您可以使用

创建一个临时表

myDf.createOrReplaceTempView("mytempTable")

然后，您可以使用简单的hive语句来创建表并从临时表中转储数据。

sqlContext.sql("create table mytable as select * from mytempTable");

Answer 2

使用DataFrameWriter.saveAsTable。（df.write.saveAsTable(...)）请参阅Spark SQL and DataFrame Guide。

Answer 3

我不会在Spark 2.0文档中看到df.write.saveAsTable(...)已弃用。它在Amazon EMR上为我们工作。我们完全能够将S3中的数据读入数据帧，处理数据，从结果中创建表并使用MicroStrategy进行读取。 Vinays回答也有效。

Answer 4

你需要/创建一个HiveContext

import org.apache.spark.sql.hive.HiveContext;

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());

然后直接保存数据框或选择要存储为hive表的列

df是dataframe

df.write().mode("overwrite").saveAsTable("schemaName.tableName");

或

df.select(df.col("col1"),df.col("col2"), df.col("col3")) .write().mode("overwrite").saveAsTable("schemaName.tableName");

或

df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");

SaveModes是Append / Ignore / Overwrite / ErrorIfExists

我在这里添加了Spark文档中HiveContext的定义，

除了基本的SQLContext之外，您还可以创建HiveContext，它提供基本SQLContext提供的功能的超集。其他功能包括使用更完整的HiveQL解析器编写查询，访问Hive UDF以及从Hive表读取数据的功能。要使用HiveContext，您不需要现有的Hive设置，并且SQLContext可用的所有数据源仍然可用。 HiveContext仅单独打包，以避免在默认的Spark构建中包含所有Hive的依赖项。

在Spark版本1.6.2上，使用“dbName.tableName”会出现此错误：

org.apache.spark.sql.AnalysisException：临时表不允许指定数据库名称或其他限定符。如果表名中包含点（。），请使用反引号（）.`
引用表名

Answer 5

保存到Hive只是使用SQLContext的write()方法：

df.write.saveAsTable(tableName)

请参阅https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(java.lang.String)

从Spark 2.2：使用DataSet代替DataFrame。

Answer 6

抱歉，帖子发到很晚，但我没有找到可接受的答案。

df.write().saveAsTable将抛出AnalysisException，并且与HIVE表不兼容。

将DF存储为df.write().format("hive")应该可以解决问题！

但是，如果这不起作用，那么请按照前面的评论和答案进行操作，这是我认为最好的解决方案（尽管可以接受建议）。

最好的方法是显式创建HIVE表（包括PARTITIONED表），

def createHiveTable: Unit ={
spark.sql("CREATE TABLE $hive_table_name($fields) " +
  "PARTITIONED BY ($partition_column String) STORED AS $StorageType")
}

将DF保存为临时表，

df.createOrReplaceTempView("$tempTableName")

并插入PARTITIONED HIVE表：

spark.sql("insert into table default.$hive_table_name PARTITION($partition_column) select * from $tempTableName")
spark.sql("select * from default.$hive_table_name").show(1000,false)

DF旁的 LAST COLUMN 将是 PARTITION COLUMN ，因此请相应地创建HIVE表！

如果有效，请发表评论！是否。

-更新-

df.write()
  .partitionBy("$partition_column")
  .format("hive")
  .mode(SaveMode.append)
  .saveAsTable($new_table_name_to_be_created_in_hive)  //Table should not exist OR should be a PARTITIONED table in HIVE

Answer 7

这是PySpark版本从镶木地板文件创建Hive表。您可能已使用推断的架构生成了Parquet文件，现在希望将定义推送到Hive Metastore。您还可以将定义推送到AWS Glue或AWS Athena等系统，而不仅仅是Hive Metastore。这里我使用spark.sql来推送/创建永久表。

   # Location where my parquet files are present.
    df = spark.read.parquet("s3://my-location/data/")
    cols = df.dtypes
    buf = []
    buf.append('CREATE EXTERNAL TABLE test123 (')
    keyanddatatypes =  df.dtypes
    sizeof = len(df.dtypes)
    print ("size----------",sizeof)
    count=1;
    for eachvalue in keyanddatatypes:
        print count,sizeof,eachvalue
        if count == sizeof:
            total = str(eachvalue[0])+str(' ')+str(eachvalue[1])
        else:
            total = str(eachvalue[0]) + str(' ') + str(eachvalue[1]) + str(',')
        buf.append(total)
        count = count + 1

    buf.append(' )')
    buf.append(' STORED as parquet ')
    buf.append("LOCATION")
    buf.append("'")
    buf.append('s3://my-location/data/')
    buf.append("'")
    buf.append("'")
    ##partition by pt
    tabledef = ''.join(buf)

    print "---------print definition ---------"
    print tabledef
    ## create a table using spark.sql. Assuming you are using spark 2.1+
    spark.sql(tabledef);

Answer 8

就我而言，这很好：

from pyspark_llap import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("DatabaseName")
df = spark.read.format("csv").option("Header",True).load("/user/csvlocation.csv")
df.write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option("table",<tablename>).save()

完成！

您可以读取数据，让您以“员工”的身份给予

hive.executeQuery("select * from Employee").show()

有关更多详细信息，请使用以下URL： https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive-read-write-operations.html

Answer 9

对于Hive外部表，我在PySpark中使用此功能：

RawBlock

Answer 10

您可以像这样使用Hortonworks spark-llap库

Answer 11

如果您要从数据框中创建配置单元表（不存在）（有时无法使用DataFrameWriter.saveAsTable创建配置单元）。 StructType.toDDL将有助于以字符串形式列出列。

val df = ...

val schemaStr = df.schema.toDDL # This gives the columns 
spark.sql(s"""create table hive_table ( ${schemaStr})""")

//Now write the dataframe to the table
df.write.saveAsTable("hive_table")

hive_table将在默认空间中创建，因为我们在spark.sql()中未提供任何数据库。 stg.hive_table可用于在hive_table数据库中创建stg。

如何将DataFrame直接保存到Hive？

11 个答案: