Question

是否有一种优雅，简单，快捷的方法将数据从Hive迁移到MongoDB中？

Answer 1

您可以使用Hadoop-MongoDB连接器进行导出。只需在作业的main方法中运行Hive查询。然后，Mapper将使用此输出将数据插入MongoDB。

示例：

这里我将一个以分号分隔的文本文件（ id; firstname; lastname ）插入MongoDB 使用简单的Hive查询进行收集：

import java.io.IOException; import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; import java.sql.Statement; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import com.mongodb.hadoop.MongoOutputFormat; import com.mongodb.hadoop.io.BSONWritable; import com.mongodb.hadoop.util.MongoConfigUtil; public class HiveToMongo extends Configured implements Tool { private static class HiveToMongoMapper extends Mapper<LongWritable, Text, IntWritable, BSONWritable> { //See: https://issues.apache.org/jira/browse/HIVE-634 private static final String HIVE_EXPORT_DELIMETER = '\001' + ""; private IntWritable k = new IntWritable(); private BSONWritable v = null; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String [] split = value.toString().split(HIVE_EXPORT_DELIMETER); k.set(Integer.parseInt(split[0])); v = new BSONWritable(); v.put("firstname", split[1]); v.put("lastname", split[2]); context.write(k, v); } } public static void main(String[] args) throws Exception { try { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) { System.out.println("Unable to load Hive Driver"); System.exit(1); } try { Connection con = DriverManager.getConnection( "jdbc:hive://localhost:10000/default"); Statement stmt = con.createStatement(); String sql = "INSERT OVERWRITE DIRECTORY " + "'hdfs://localhost:8020/user/hive/tmp' select * from users"; stmt.executeQuery(sql); } catch (SQLException e) { System.exit(1); } int res = ToolRunner.run(new Configuration(), new HiveToMongo(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); Path inputPath = new Path("/user/hive/tmp"); String mongoDbPath = "mongodb://127.0.0.1:6900/mongo_users.mycoll"; MongoConfigUtil.setOutputURI(conf, mongoDbPath); /* Add dependencies to distributed cache via DistributedCache.addFileToClassPath(...) : - mongo-hadoop-core-x.x.x.jar - mongo-java-driver-x.x.x.jar - hive-jdbc-x.x.x.jar HadoopUtils is an own utility class */ HadoopUtils.addDependenciesToDistributedCache("/libs/mongodb", conf); HadoopUtils.addDependenciesToDistributedCache("/libs/hive", conf); Job job = new Job(conf, "HiveToMongo"); FileInputFormat.setInputPaths(job, inputPath); job.setJarByClass(HiveToMongo.class); job.setMapperClass(HiveToMongoMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(MongoOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setNumReduceTasks(0); job.submit(); System.out.println("Job submitted."); return 0; } }

一个缺点是需要'暂存区'（/ user / hive / tmp）来存储中间Hive输出。此外，据我所知，Mongo-Hadoop连接器不支持upserts。

我不太确定，但您也可以尝试从Hive获取数据而不运行 hiveserver公开Thrift服务，以便您可以节省一些开销。查看实际执行查询的Hive org.apache.hadoop.hive.cli.CliDriver#processLine(String line, boolean allowInterupting)方法的源代码。然后你可以将这样的东西混在一起：

... LogUtils.initHiveLog4j(); CliSessionState ss = new CliSessionState(new HiveConf(SessionState.class)); ss.in = System.in; ss.out = new PrintStream(System.out, true, "UTF-8"); ss.err = new PrintStream(System.err, true, "UTF-8"); SessionState.start(ss); Driver qp = new Driver(); processLocalCmd("SELECT * from users", qp, ss); //taken from CliDriver ...

附注：

还有一个hive-mongo连接器实现，您也可以检查。如果你想为MongoDB实现一个类似的连接器，也值得一看Hive-HBase连接器的实现。

Answer 2

你有没有看过Sqoop？它应该使在Hadoop和SQL / NoSQL数据库之间移动数据变得非常简单。 This article还举例说明了如何将其与Hive一起使用。

Answer 3

查看hadoop-MongoDB连接器项目：

http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html

“这种连接的形式是允许将MongoDB数据读入Hadoop（用于MapReduce作业以及Hadoop生态系统的其他组件），以及将Hadoop作业的结果写入MongoDB。” p>

不确定它是否适用于您的用例，但值得一看。

将数据从Hive迁移到MongoDB的最有效方法是什么？

3 个答案: