我从postgresql获得实时更新数据,并希望将实时数据传输到固定模型,以通过spark streaming或kafka预测客户。
请推荐任何工作良好的博客和确切代码,或者您知道的任何信息/建议.postgresql / mysql实时更新数据到python / java环境也行!谢谢!
或者它可能无法实现这一目标?
答案 0 :(得分:0)
这是我的助手,希望能为您提供帮助。
我的spark版本是2.2.0。编程语言是Python。
数据流从kafka到mysql,且kafka版本为0.9。
注意: 您必须找到带有mysql和kafka的正确jar,您可以转到官方网站找到它。
这样的代码:
from pyspark import SparkContext, Row
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# note: the mysql's driver is must be correct
def getSparkSessionInstance(sparkConf):
if ('sparkSessionSingletonInstance' not in globals()):
globals()['sparkSessionSingletonInstance'] = SparkSession\
.builder\
.config(conf=sparkConf)\
.getOrCreate()
return globals()['sparkSessionSingletonInstance']
if __name__ == "__main__":
# mysql config
url = "jdbc:mysql://your_server:3306/spark_test"
table_name = "word_info"
username = "root"
pasword = "root"
# spark context init
para_seconds = 10
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, para_seconds)
# receiver in kafka
brokers = 'kafka1:9092'
topic = 'two-two-para'
# get streaming datas from kafka
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
# Convert RDDs of the words DStream to DataFrame and run SQL query
def process(time, rdd):
print("========= %s =========" % str(time))
if (rdd.isEmpty()):
return
try:
# Get the singleton instance of SparkSession
spark = getSparkSessionInstance(rdd.context.getConf())
# Convert RDD[String] to RDD[Row] to DataFrame
rowRdd = rdd.map(lambda w: Row(word=w))
wordsDataFrame = spark.createDataFrame(rowRdd)
# Creates a temporary view using the DataFrame.
wordsDataFrame.createOrReplaceTempView("words")
# Do word count on table using SQL and print it
wordCountsDataFrame = \
spark.sql("select word, count(*) as word_count from words group by word")
wordCountsDataFrame.show()
wordCountsDataFrame.write \
.format("jdbc") \
.option("url", url) \
.option("driver", "org.mariadb.jdbc.Driver") \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", pasword) \
.save(mode="append")
except Exception as e:
print("Some error happen!")
print(e)
lines.foreachRDD(process)
# start job
ssc.start()
ssc.awaitTermination()