我在Hive中有一张表如下:
hive> create table if not exists stock_quote (TradeDay string, TradeTime string, OpenPrice string, HighPrice string, LowPrice String, ClosePrice String, volume string) partitioned by (tickerid string) row format delimited fields terminated by ',' stored as textfile;
尝试通过以下代码插入表格:
sc = spark.sparkContext
lines = sc.textFile('file:///<File Name>')
rows = lines.map(lambda line : line.split(','))
rows_map = rows.map(lambda row : Row(tickerid = row[0], tradeday = row[1], tradetime = row[2],
openprice = row[3], highprice = row[4],
lowprice = row[5], closeprice = row[6],
volume = row[7]))
rows_df = spark.createDataFrame(rows_map)
rows_df.write.format('hive').mode('append').partitionBy('tickerid').saveAsTable('stock_quote')
获取以下错误:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.saveAsTable.
: org.apache.spark.SparkException: Requested partitioning does not match the stock_quote table:
Requested partitions:
Table partitions: tickerid
尝试以下:
stock_quote_table = namedtuple("stock_quote",
["tickerid", "tradeday", "tradetime", "openprice", "highprice", "lowprice", "closeprice", "volume"])
rows_map = rows.map(lambda row : stock_quote_table(row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7]))
rows_df = spark.createDataFrame(rows_map)
rows_df.write.mode('append').partitionBy('tickerid').insertInto('default.stock_quote')
出现以下错误:
pyspark.sql.utils.AnalysisException: "insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().;"
所以改变了最后一行:
rows_df.write.mode('append').insertInto('default.stock_quote')
上面插入的数据到表中,但它为HDFS下的文件中的每一行创建了一个子目录,其中tickerid = like / user / hive / warehouse / stock_quote / tickerid = 980,在此之下,文件名以'part开头...'
请说明代码中出现了什么问题。