Question

我在Hive中有一张表如下：

hive> create table if not exists stock_quote (TradeDay string, TradeTime string, OpenPrice string, HighPrice string, LowPrice String, ClosePrice String, volume string) partitioned by (tickerid string) row format delimited fields terminated by ',' stored as textfile;

尝试通过以下代码插入表格：

sc = spark.sparkContext
lines = sc.textFile('file:///<File Name>')
rows = lines.map(lambda line : line.split(','))
rows_map = rows.map(lambda row : Row(tickerid = row[0], tradeday = row[1], tradetime = row[2],
                                    openprice = row[3], highprice = row[4],
                                    lowprice = row[5], closeprice = row[6],
                                    volume = row[7]))
rows_df = spark.createDataFrame(rows_map)
rows_df.write.format('hive').mode('append').partitionBy('tickerid').saveAsTable('stock_quote')

获取以下错误：

py4j.protocol.Py4JJavaError: An error occurred while calling o72.saveAsTable.
: org.apache.spark.SparkException: Requested partitioning does not match the stock_quote table:
Requested partitions: 
Table partitions: tickerid

尝试以下：

stock_quote_table = namedtuple("stock_quote", 
                               ["tickerid", "tradeday", "tradetime", "openprice", "highprice", "lowprice", "closeprice", "volume"])
rows_map = rows.map(lambda row : stock_quote_table(row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7]))
rows_df = spark.createDataFrame(rows_map)
rows_df.write.mode('append').partitionBy('tickerid').insertInto('default.stock_quote')

出现以下错误：

pyspark.sql.utils.AnalysisException: "insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().;"

所以改变了最后一行：

rows_df.write.mode('append').insertInto('default.stock_quote')

上面插入的数据到表中，但它为HDFS下的文件中的每一行创建了一个子目录，其中tickerid = like / user / hive / warehouse / stock_quote / tickerid = 980，在此之下，文件名以'part开头...'

请说明代码中出现了什么问题。

pyspark将数据帧划分为分区的Hive表

0 个答案: