将RDD加载到配置单元中

时间:2017-01-09 11:47:04

标签: apache-spark dataframe hive pyspark pyspark-sql

我想使用spark版本1.6.x中的pyspark将RDD(k = table_name,v = content)加载到分区的配置单元表(年,月,日)中

整个尝试使用此SQL查询的逻辑时:

ALTER TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% DROP IF EXISTS PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);LOAD DATA INTO TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);

有人可以提出一些建议吗?

1 个答案:

答案 0 :(得分:1)

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.sparkContext.parallelize([(1, 'cat', '2016-12-20'), (2, 'dog', '2016-12-21')])
df = spark.createDataFrame(df, schema=['id', 'val', 'dt'])
df.write.saveAsTable(name='default.test', format='orc', mode='overwrite', partitionBy='dt')

使用 enableHiveSupport() df.write.saveAsTable()