pyspark将数据帧划分为分区的Hive表

时间:2018-04-30 12:47:02

标签: apache-spark hive partition

我在Hive中有一张表如下:

hive> create table if not exists stock_quote (TradeDay string, TradeTime string, OpenPrice string, HighPrice string, LowPrice String, ClosePrice String, volume string) partitioned by (tickerid string) row format delimited fields terminated by ',' stored as textfile;

尝试通过以下代码插入表格:

sc = spark.sparkContext
lines = sc.textFile('file:///<File Name>')
rows = lines.map(lambda line : line.split(','))
rows_map = rows.map(lambda row : Row(tickerid = row[0], tradeday = row[1], tradetime = row[2],
                                    openprice = row[3], highprice = row[4],
                                    lowprice = row[5], closeprice = row[6],
                                    volume = row[7]))
rows_df = spark.createDataFrame(rows_map)
rows_df.write.format('hive').mode('append').partitionBy('tickerid').saveAsTable('stock_quote')

获取以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling o72.saveAsTable.
: org.apache.spark.SparkException: Requested partitioning does not match the stock_quote table:
Requested partitions: 
Table partitions: tickerid

尝试以下:

stock_quote_table = namedtuple("stock_quote", 
                               ["tickerid", "tradeday", "tradetime", "openprice", "highprice", "lowprice", "closeprice", "volume"])
rows_map = rows.map(lambda row : stock_quote_table(row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7]))
rows_df = spark.createDataFrame(rows_map)
rows_df.write.mode('append').partitionBy('tickerid').insertInto('default.stock_quote')

出现以下错误:

pyspark.sql.utils.AnalysisException: "insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().;"

所以改变了最后一行:

rows_df.write.mode('append').insertInto('default.stock_quote')

上面插入的数据到表中,但它为HDFS下的文件中的每一行创建了一个子目录,其中tickerid = like / user / hive / warehouse / stock_quote / tickerid = 980,在此之下,文件名以'part开头...'

请说明代码中出现了什么问题。

0 个答案:

没有答案