您能否建议以下方法是否正确?我是Spark的新手,我想将数据插入到现有表中。
Dataset<Row> logDataFrame = spark.createDataFrame(rowRDD, schema);
if (spark.catalog().tableExists("mylogs")) {
logDataFrame.write().mode("append").insertInto("mylogs");// exception
} else {
logDataFrame.createOrReplaceTempView("mylogs"); // This is working fine
}
Dataset<Row> results = spark.sql("SELECT count(a1) FROM mylogs");
获得以下异常:
Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false;;
'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
+- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:352)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:350)
根据评论编辑代码:
Dataset<Row> logDataFrame = sparkSession.createDataFrame(rowRDD, schema);
if (sparkSession.catalog().tableExists("mylogs")) {
logDataFrame.registerTempTable("temptable");
sparkSession.sql("insert into table mylogs select * from temptable");
//logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");
} else {
logDataFrame.createOrReplaceTempView("mylogs");
}
Dataset<Row> results = sparkSession.sql("SELECT count(a1) FROM mylogs");
低于错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false;;
'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
+- Project [a1#22, b1#23, c1#24, d1#25]
+- SubqueryAlias temptable
+- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
答案 0 :(得分:0)
首先将您的数据框注册为tempTable
select [Less Total Backlog QTY] - Daily_Capacity from b where PrimaryKey = '2018-04-16CMKKB113'
然后用
替换你的异常声明logDataFrame.registerTempTable("temptable")
答案 1 :(得分:0)
您可以使用SparkSession API从文本文件创建Spark数据集。
根据您在评论中提供的数据样本,我创建了一个名为Log
的POJOpublic class Log implements Serializable{
private String col1;
private String col2;
private String col3;
private String col4;
private String col5;
private String col6;
private String col7;
// getters and setters here
}
使用这个,我已经应用了flatmap将日志行转换为Log对象。
public class LogToDataset {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("Log Job").master("spark://localhost:7077")
.getOrCreate();
Dataset<String> textDF = spark.read()
.text("C:\\Users\\log4jFile.txt")
.as(Encoders.STRING());
JavaRDD<Log> logRDD = textDF.toJavaRDD().map(line -> {
String[] data =line.split(" ");
Log log = new Log();
log.setCol1(data[0]);
log.setCol2(data[1]);
log.setCol3(data[2]);
log.setCol4(data[3]);
log.setCol5(data[4]);
log.setCol6(data[5]);
log.setCol7(data[6]);
return log;
});
Dataset<Row> logDataset = spark.createDataFrame(logRDD, Log.class);
logDataset.write().mode(SaveMode.Append).insertInto("hivelogtable");
logDataset.createOrReplaceTempView("logtable");
spark.sql("select * from logtable").show();
}
}
现在,您应该可以使用其他人在评论中提到的insertInto()
或saveAsTable()
将数据插入到表格中。
以下是用于测试此代码的示例数据
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512]
最终输出:
这是我查询配置单元表时的输出。
+-----+------------+-----+-----+----+------------+-------------+
| col1| col2| col3| col4|col5| col6| col7|
+-----+------------+-----+-----+----+------------+-------------+
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575| V|Musicplayer:|[Coming::512]|
+-----+------------+-----+-----+----+------------+-------------+
答案 2 :(得分:0)
我希望它对某人有帮助,需要检查'mylogs'表是否存在,否则如果我们尝试在运行时直接使用'Append'键,它会抛出异常'mylogs'表不会退出。
并且不需要在'临时'表格中随意乱转。
logDataFrame = spark.createDataFrame(rowRDD, schema);
if (spark.catalog().tableExists("mylogs")) {
logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");
} else {
logDataFrame.createOrReplaceTempView("mylogs");
}