尝试使用spark 2.3(java)代码从Redis读取数据。 能够从Redis读取非流数据,但是在从Redis流读取的情况下无法读取数据,我遇到以下错误: 1)当我将格式指定为:
import sys
from geojson import Feature, Point, FeatureCollection, Polygon
import json
import geojson
import pprint
import pandas as pd
import re
import os
#print(sys.version)
with open('gadm36_IND_1.json', 'r') as data_file:
data = json.load(data_file)
for feature in data['features']:
#change all the Name_1 into lowercase
name1 = feature['properties']['NAME_1']
for f in re.findall("([A-Z]+)", name1):
name1 = name1.replace(f, f.lower())
with open('lower.json', 'w+') as data_file:
json.dump(data, data_file, indent=2)
错误是:
Dataset<Row> RedisData = spark.readStream()
.format("org.apache.spark.sql.redis")
.option("stream.keys","carsstream")
.schema(UserSchema5)
.load();
2)当我将格式指定为:
java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.redis does not support streamed reading
错误是
Dataset<Row> RedisData = spark.readStream()
.format("redis")
.option("stream.keys","carsstream")
.schema(UserSchema5)
.load();
我指定了java.lang.ClassNotFoundException: Failed to find data source: redis. Please find packages at http://spark.apache.org/third-party-projects.html
,Jedis(version 3.1.0)
,spark-redis(2.3.1)
的罐子。
任何建议都会有所帮助。
答案 0 :(得分:0)
根据您所面临的错误以及您尝试实现的代码,我推断出您正在使用 Spark Structured Streaming 。请参考以下摘录供您参考。另外,我还共享了到我的GitHub存储库的链接,您可以在其中找到完整的代码。
您需要创建DataFrame/Dataset
,而不是Streaming DataFrame/Dataset
。
因此,您必须执行以下操作:
val keysPattern = s"${topic}:*"
// SCHEMA FOR STATE DATA CACHED IN REDIS DATA
val redisSchema = StructType(
List(
StructField("col1",StringType,true),
StructField("col2",StringType,true),
StructField("col3",StringType,true)
)
)
val redisDf = spark.read
.format("org.apache.spark.sql.redis")
.schema(redisSchema)
.option("keys.pattern", keysPattern)
.load
您将dataframe
与Streaming DataFrame
像这样加入:
// JOIN THE STREAMING DATA WITH REGULAR DATAFRAME
val joinedDf = streamingDf.joinWith(
redisDf,
trim(col("col1")) === trim(col("u_id")),
"left"
).select("_1.*", "_2.*")
regular DataFrame
和Streaming DataFrame
的连接产生Streaming DataFrame
。要将Streaming DataFrame
写回Redis,您还需要实现一个foreach writer
。看起来像这样:
// REDIS CONNECTOR - FOREACHWRITER SINK
val redisForeachWriter : RedisForeachWriter = new RedisForeachWriter("localhost","6379", topic)
// PUSH NEW USER DETAILS TO REDIS FOR STATE REFERENCE
val redisSinkQuery = joinedDf
.select(
"col1", "col2", ... , "coln"
)
.writeStream
.outputMode("update")
.foreach(redisForeachWriter)
.start
示例 RedisForeachWriter 如下:
import org.apache.spark.sql.ForeachWriter
import org.apache.spark.sql.Row
import redis.clients.jedis.Jedis
import scala.collection.mutable.HashMap
import scala.collection.JavaConversions._
import scala.collection.JavaConversions
import org.apache.spark.sql.Dataset
class RedisForeachWriter(val host: String, port: String, val topic: String) extends ForeachWriter[Row]{
// val host: String = p_host
// val port: String = p_port
var jedis: Jedis = _
def connect() = {
jedis = new Jedis(host, port.toInt)
}
override def open(partitionId: Long, version: Long): Boolean = {
return true
}
override def process(record: Row) = {
val u_id = record.getString(1);
if( !(u_id == null || u_id.isEmpty())){
val columns : Array[String] = record.schema.fieldNames
if(jedis == null){
connect()
}
for(i <- 0 until columns.length){
if(! ((record.getString(i) == null) || (record.getString(i).isEmpty()) || record.getString(i) == "") )
jedis.hset(s"${topic}:" + u_id, columns(i), record.getString(i))
}
}
}
override def close(errorOrNull: Throwable) = {
}
}
您可以使用类似的用例来引用我的Github,以供参考并进行澄清。 https://github.com/krohit-scala/MSStreamingStack
编辑:请在您的应用程序POM中添加这些依赖项。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>com.redislabs</groupId>
<artifactId>spark-redis</artifactId>
<version>2.3.1</version>
</dependency>