如何通过spark streaming(或kafka)从postgresql(或mysql)读取实时更新数据?

时间:2018-05-22 23:47:46

标签: java python database apache-kafka spark-streaming

我从postgresql获得实时更新数据,并希望将实时数据传输到固定模型,以通过spark streaming或kafka预测客户。

请推荐任何工作良好的博客和确切代码,或者您知道的任何信息/建议.postgresql / mysql实时更新数据到python / java环境也行!谢谢!

或者它可能无法实现这一目标?

1 个答案:

答案 0 :(得分:0)

这是我的助手,希望能为您提供帮助。

我的spark版本是2.2.0。编程语言是Python。

数据流从kafka到mysql,且kafka版本为0.9。

注意: 您必须找到带有mysql和kafka的正确jar,您可以转到官方网站找到它。

这样的代码:

from pyspark import SparkContext, Row
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# note: the mysql's driver is must be correct

def getSparkSessionInstance(sparkConf):
    if ('sparkSessionSingletonInstance' not in globals()):
        globals()['sparkSessionSingletonInstance'] = SparkSession\
            .builder\
            .config(conf=sparkConf)\
            .getOrCreate()
    return globals()['sparkSessionSingletonInstance']

if __name__ == "__main__":

    # mysql config
    url = "jdbc:mysql://your_server:3306/spark_test"
    table_name = "word_info"
    username = "root"
    pasword = "root"

    # spark context init
    para_seconds = 10
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, para_seconds)

    # receiver in kafka
    brokers = 'kafka1:9092'
    topic = 'two-two-para'

    # get streaming datas from kafka
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})

    lines = kvs.map(lambda x: x[1])

    # Convert RDDs of the words DStream to DataFrame and run SQL query
    def process(time, rdd):
        print("========= %s =========" % str(time))

        if (rdd.isEmpty()):
            return

        try:
            # Get the singleton instance of SparkSession
            spark = getSparkSessionInstance(rdd.context.getConf())

            # Convert RDD[String] to RDD[Row] to DataFrame
            rowRdd = rdd.map(lambda w: Row(word=w))
            wordsDataFrame = spark.createDataFrame(rowRdd)

            # Creates a temporary view using the DataFrame.
            wordsDataFrame.createOrReplaceTempView("words")

            # Do word count on table using SQL and print it
            wordCountsDataFrame = \
                spark.sql("select word, count(*) as word_count from words group by word")
            wordCountsDataFrame.show()

            wordCountsDataFrame.write \
            .format("jdbc") \
            .option("url", url) \
            .option("driver", "org.mariadb.jdbc.Driver") \
            .option("dbtable", table_name) \
            .option("user", username) \
            .option("password", pasword) \
            .save(mode="append")

        except Exception as e:
            print("Some error happen!")
            print(e)

    lines.foreachRDD(process)


    # start job
    ssc.start()
    ssc.awaitTermination()