我从Kafka获取数据并在Spark Streaming中处理并将数据写入Cassandra
我正在尝试过滤DStream记录,但它没有过滤记录并在Cassandra中写完整记录,
任何有关样本/示例代码的建议,以过滤多列记录和任何帮助都将受到高度赞赏我已经对此进行了研究,但无法获得任何解决方案。
class SparkKafkaConsumer1(val recordStream : org.apache.spark.streaming.dstream.DStream[String], val streaming : StreamingContext) {
val internationalAddress = recordStream.map(line => line.split("\\|")(10).toUpperCase)
def timeToStr(epochMillis: Long): String =
DateTimeFormat.forPattern("YYYYMMddHHmmss").print(epochMillis)
if(internationalAddress =="INDIA")
{
print("-----------------------------------------------")
recordStream.print()
val riskScore = "1"
val timestamp: Long = System.currentTimeMillis
val formatedTimeStamp = timeToStr(timestamp)
var wc1 = recordStream.map(_.split("\\|")).map(r=>Row(r(0),r(1),r(2),r(3),r(4).toInt,r(5).toInt,r(6).toInt,r(7),r(8),r(9),r(10),r(11),r(12),r(13),r(14),r(15),r(16),riskScore.toInt,0,0,0,formatedTimeStamp))
implicit val rowWriter = SqlRowWriter.Factory
wc1.saveToCassandra("fraud", "fraudrating", SomeColumns("purchasetimestamp","sessionid","productdetails","emailid","productprice","itemcount","totalprice","itemtype","luxaryitem","shippingaddress","country","bank","typeofcard","creditordebitcardnumber","contactdetails","multipleitem","ipaddress","consumer1score","consumer2score","consumer3score","consumer4score","recordedtimestamp"))
}
(注意:我在Kafka拥有internationalAddress = INDIA的记录,而且我对Scala非常陌生)
答案 0 :(得分:1)
我不确定你要做的是什么,但如果你只是想过滤有关印度的记录,你可以这样做:
implicit val rowWriter = SqlRowWriter.Factory
recordStream
.filter(_.split("\\|")(10).toUpperCase) == "INDIA")
.map(_.split("\\|"))
.map(r => Row(...))
.saveToCassandra(...)
作为旁注,我认为case classes对您有用。