有一些方法可以将org.apache.spark.sql.DataFrame
的数据保存到文件系统或Hive。但是如何将MongoDB数据上创建的DataFrame
数据保存回MongoDB?
修改:我使用
创建了DataFrameSparkContext sc = new SparkContext()
Configuration config = new Configuration();
config.set("mongo.input.uri","mongodb://localhost:27017:testDB.testCollection);
JavaRDD<Tuple2<Object, BSONObject>> mongoJavaRDD = sc.newAPIHadoopRDD(config, MongoInputFormat.class, Object.class,
BSONObject.class).toJavaRDD();
JavaRDD<Object> mongoRDD = mongoJavaRDD.flatMap(new FlatMapFunction<Tuple2<Object, BSONObject>, Object>()
{
@Override
public Iterable<Object> call(Tuple2<Object, BSONObject> arg)
{
BSONObject obj = arg._2();
Object javaObject = generateJavaObjectFromBSON(obj, clazz);
return Arrays.asList(javaObject);
}
});
sqlContext = new SqlContext(sc);
DataFrame df = sqlContext.createDataFrame(mongoRDD, Person.class).registerTempTable("Person");
答案 0 :(得分:3)
使用PySpark并假设您有一个本地MongoDB实例:
import pymongo
from toolz import dissoc
# First, lets create some dummy collection
client = pymongo.MongoClient()
client["foo"]["bar"].insert([{"k": "foo", "v": 1}, {"k": "bar", "v": 2}])
client.close()
config = {
"mongo.input.uri": "mongodb://localhost:27017/foo.bar",
"mongo.output.uri": "mongodb://localhost:27017/foo.barplus"
}
# Read data from MongoDB
rdd = sc.newAPIHadoopRDD(
"com.mongodb.hadoop.MongoInputFormat",
"org.apache.hadoop.io.Text",
"org.apache.hadoop.io.MapWritable",
None, None, config)
# Drop _id field and create data frame
dt = sqlContext.createDataFrame(rdd.map(lambda (k, v): dissoc(v, "_id")))
dt_plus_one = dt.select(dt["k"], (dt["v"] + 1).alias("v"))
(dt_plus_one.
rdd. # Extract rdd
map(lambda row: (None, row.asDict())). # Map to (None, dict) pairs
saveAsNewAPIHadoopFile(
"file:///placeholder", # Ignored
# From org.mongodb.mongo-hadoop:mongo-hadoop-core
"com.mongodb.hadoop.MongoOutputFormat",
None, None, None, None, config))