需要从Kafka流中读取结构化数据流,并将其写入到现有的Hive表中。经过分析,似乎其中一种选择是执行Kafka源代码的readStream,然后对HDFS文件路径中的文件接收器执行writeStream。
我的问题是-是否可以直接写入Hive表?或者,对于这种用例,是否有可以遵循的解决方法?
EDIT1:
.foreachBatch-似乎可以正常工作,但是存在下面提到的问题
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SaveMode
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
//subscribe to kafka topic
val csvDF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "xxxxxx912:9092").option("subscribe", "testtest").load()
val abcd = csvDF.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","CAST(topic AS STRING)","CAST(offset AS STRING)","CAST(partition AS STRING)","CAST(timestamp AS STRING)").as[(String, String, String, String, String, String)]
val query = abcd.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) => {batchDs.write.mode(SaveMode.Append).insertInto("default.6columns");}).option("quote", "\u0000").start()
hive> select * from 6columns;
OK
0 A3,L1,G1,P1,O1,101,TXN1 testtest 122 0 2019-05-23 12:38:49.515
0 A3,L1,G1,P1,O1,102,TXN2 testtest 123 0 2019-05-23 12:38:49.524
0 A1,L1,G1,P1,O1,100,TXN3 testtest 124 0 2019-05-23 12:38:49.524
0 A2,L2,G1,P1,O2,100,TXN4 testtest 125 0 2019-05-23 12:38:49.524
0 A3,L1,G1,P1,O1,103,TXN5 testtest 126 0 2019-05-23 12:38:54.525
0 A3,L1,G1,P1,O1,104,TXN6 testtest 127 0 2019-05-23 12:38:55.525
0 A4,L1,G1,P1,O1,100,TXN7 testtest 128 0 2019-05-23 12:38:56.526
0 A1,L1,G1,P1,O1,500,TXNID8 testtest 129 0 2019-05-23 12:38:57.526
0 A6,L2,G2,P1,O1,500,TXNID9 testtest 130 0 2019-05-23 12:38:57.526
我要寻找的是拆分Kafka消息的value属性,以便数据类似于Hive表,它变成了12列表(A3,L1,G1,P1,O1、101,TXN1-拆分为7属性)。需要一些类似于我在编写数据帧时所做的.option(“ quote”,“ \ u0000”)转换。但似乎没有用。
答案 0 :(得分:1)
从kafka设置并使用流后,就可以像这样使用forEachBatch
函数。
val yourStream = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.load()
val query = yourStream.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) => {
batchDs
.write
.mode(SaveMode.Append)
.insertInto("your_db.your_table");
}).start()
query.awaitTermination()
要获取由,
拆分为单独列的字符串,可以使用split
函数将由,
分隔的所有项目都放入数组中,然后可以选择项目分别通过索引,例如"SPLIT(CAST(value AS STRING), ',')[0]"
获得第一个元素。
因此替换
val abcd = csvDF.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","CAST(topic AS STRING)","CAST(offset AS STRING)","CAST(partition AS STRING)","CAST(timestamp AS STRING)").as[(String, String, String, String, String, String)]
与
val abcd = csvDF.selectExpr("CAST(key AS STRING)", "SPLIT(CAST(value AS STRING), ',')[0]", "SPLIT(CAST(value AS STRING), ',')[1]", "SPLIT(CAST(value AS STRING), ',')[2]", "SPLIT(CAST(value AS STRING), ',')[3]", "SPLIT(CAST(value AS STRING), ',')[4]", "SPLIT(CAST(value AS STRING), ',')[5]", "SPLIT(CAST(value AS STRING), ',')[6]", "CAST(topic AS STRING)", "CAST(offset AS STRING)", "CAST(partition AS STRING)", "CAST(timestamp AS STRING)").as[(String, String, String, String, String, String, String, String, String, String, String)]