Can data be loaded in Apache Spark RDD/Dataframe on the fly?

时间:2015-09-01 21:16:36

标签: apache-spark

Can data be loaded on the fly or does it have be pre-loaded into the RDD/DataFrame?

Say I have a SQL database and I use the JDBC source to load 1,000,000 records into the RDD. If for example a new records comes in the DB, can I write a job that will add that 1 new record the RDD/Dataframe to make it 1,000,001? Or does the entire RDD/DataFrame have to be rebuilt?

1 个答案:

答案 0 :(得分:1)

我想这取决于添加(...)记录重建的含义。可以使用SparkContext.unionRDD.union合并RDD和DataFrame.unionAll来合并DataFrame。

只要合并的RDD使用相同的序列化程序,就不需要进行重新分类,但是如果两者都使用相同的分区,则需要重新分区。

以JDBC源为例:

import org.apache.spark.sql.functions.{max, lit}

val pMap = Map("url" -> "jdbc:..", "dbtable" -> "test")

// Load first batch
val df1 = sqlContext.load("jdbc", pMap).cache

// Get max id and trigger cache
val maxId = df1.select(max($"id")).first().getInt(0)

// Some inserts here...

// Get new records
val dfDiff = sqlContext.load("jdbc", pMap).where($"id" > lit(maxId))

// Combine - only dfDiff has to be fetched
// Should be cached as before
df1.unionAll(dfDiff)

如果需要可更新的数据结构IndexedRDD在Spark上实现键值存储。