是否有一种优雅,简单,快捷的方法将数据从Hive迁移到MongoDB中?
答案 0 :(得分:2)
您可以使用Hadoop-MongoDB连接器进行导出。只需在作业的main方法中运行Hive查询。然后,Mapper将使用此输出将数据插入MongoDB
。
示例:强>
这里我将一个以分号分隔的文本文件( id; firstname; lastname )插入MongoDB 使用简单的Hive查询进行收集:
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.mongodb.hadoop.MongoOutputFormat;
import com.mongodb.hadoop.io.BSONWritable;
import com.mongodb.hadoop.util.MongoConfigUtil;
public class HiveToMongo extends Configured implements Tool {
private static class HiveToMongoMapper extends
Mapper<LongWritable, Text, IntWritable, BSONWritable> {
//See: https://issues.apache.org/jira/browse/HIVE-634
private static final String HIVE_EXPORT_DELIMETER = '\001' + "";
private IntWritable k = new IntWritable();
private BSONWritable v = null;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String [] split = value.toString().split(HIVE_EXPORT_DELIMETER);
k.set(Integer.parseInt(split[0]));
v = new BSONWritable();
v.put("firstname", split[1]);
v.put("lastname", split[2]);
context.write(k, v);
}
}
public static void main(String[] args) throws Exception {
try {
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
}
catch (ClassNotFoundException e) {
System.out.println("Unable to load Hive Driver");
System.exit(1);
}
try {
Connection con = DriverManager.getConnection(
"jdbc:hive://localhost:10000/default");
Statement stmt = con.createStatement();
String sql = "INSERT OVERWRITE DIRECTORY " +
"'hdfs://localhost:8020/user/hive/tmp' select * from users";
stmt.executeQuery(sql);
}
catch (SQLException e) {
System.exit(1);
}
int res = ToolRunner.run(new Configuration(), new HiveToMongo(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Path inputPath = new Path("/user/hive/tmp");
String mongoDbPath = "mongodb://127.0.0.1:6900/mongo_users.mycoll";
MongoConfigUtil.setOutputURI(conf, mongoDbPath);
/*
Add dependencies to distributed cache via
DistributedCache.addFileToClassPath(...) :
- mongo-hadoop-core-x.x.x.jar
- mongo-java-driver-x.x.x.jar
- hive-jdbc-x.x.x.jar
HadoopUtils is an own utility class
*/
HadoopUtils.addDependenciesToDistributedCache("/libs/mongodb", conf);
HadoopUtils.addDependenciesToDistributedCache("/libs/hive", conf);
Job job = new Job(conf, "HiveToMongo");
FileInputFormat.setInputPaths(job, inputPath);
job.setJarByClass(HiveToMongo.class);
job.setMapperClass(HiveToMongoMapper.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(MongoOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0);
job.submit();
System.out.println("Job submitted.");
return 0;
}
}
一个缺点是需要'暂存区'(/ user / hive / tmp)来存储中间Hive输出。此外,据我所知,Mongo-Hadoop连接器不支持upserts。
我不太确定,但您也可以尝试从Hive
获取数据而不运行
hiveserver
公开Thrift服务,以便您可以节省一些开销。
查看实际执行查询的Hive org.apache.hadoop.hive.cli.CliDriver#processLine(String line, boolean allowInterupting)
方法的源代码。然后你可以将这样的东西混在一起:
...
LogUtils.initHiveLog4j();
CliSessionState ss = new CliSessionState(new HiveConf(SessionState.class));
ss.in = System.in;
ss.out = new PrintStream(System.out, true, "UTF-8");
ss.err = new PrintStream(System.err, true, "UTF-8");
SessionState.start(ss);
Driver qp = new Driver();
processLocalCmd("SELECT * from users", qp, ss); //taken from CliDriver
...
附注:
还有一个hive-mongo连接器实现,您也可以检查。
如果你想为MongoDB
实现一个类似的连接器,也值得一看Hive-HBase连接器的实现。
答案 1 :(得分:1)
你有没有看过Sqoop?它应该使在Hadoop和SQL / NoSQL数据库之间移动数据变得非常简单。 This article还举例说明了如何将其与Hive一起使用。
答案 2 :(得分:1)
查看hadoop-MongoDB
连接器项目:
http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html
“这种连接的形式是允许将MongoDB数据读入Hadoop(用于MapReduce作业以及Hadoop生态系统的其他组件),以及将Hadoop作业的结果写入MongoDB。” p>
不确定它是否适用于您的用例,但值得一看。