如果我在配置单元表中有数十亿条记录,则以下哪种方法更好:
直接:
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("DCA_HIVE_HDFS");
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table(tableName);
df.write().orc(outputHdfsFile);
使用JDBC:
SparkConf conf = new SparkConf(true).setMaster("yarn-cluster").setAppName("DCA_HIVE_HDFS");
SparkContext sc = new SparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
Properties props = new Properties();
props.setProperty("user", userName);
props.setProperty("password", password);
props.setProperty("driver", driverName);
DataFrame df = sqlContext.read().jdbc(connectionUri, tableName, props);
df.write().orc(outputHdfsFile);