Can data be loaded on the fly or does it have be pre-loaded into the RDD/DataFrame?
Say I have a SQL database and I use the JDBC source to load 1,000,000 records into the RDD. If for example a new records comes in the DB, can I write a job that will add that 1 new record the RDD/Dataframe to make it 1,000,001? Or does the entire RDD/DataFrame have to be rebuilt?
答案 0 :(得分:1)
我想这取决于添加(...)记录和重建的含义。可以使用SparkContext.union
或RDD.union
合并RDD和DataFrame.unionAll
来合并DataFrame。
只要合并的RDD使用相同的序列化程序,就不需要进行重新分类,但是如果两者都使用相同的分区,则需要重新分区。
以JDBC源为例:
import org.apache.spark.sql.functions.{max, lit}
val pMap = Map("url" -> "jdbc:..", "dbtable" -> "test")
// Load first batch
val df1 = sqlContext.load("jdbc", pMap).cache
// Get max id and trigger cache
val maxId = df1.select(max($"id")).first().getInt(0)
// Some inserts here...
// Get new records
val dfDiff = sqlContext.load("jdbc", pMap).where($"id" > lit(maxId))
// Combine - only dfDiff has to be fetched
// Should be cached as before
df1.unionAll(dfDiff)
如果需要可更新的数据结构IndexedRDD
在Spark上实现键值存储。