Spark Sql 2.3 - DataFrame - SaveMode.Append - 问题

时间:2018-04-24 05:33:57

标签: apache-spark apache-spark-sql spark-dataframe

您能否建议以下方法是否正确?我是Spark的新手,我想将数据插入到现有表中。

    Dataset<Row> logDataFrame = spark.createDataFrame(rowRDD, schema);

    if (spark.catalog().tableExists("mylogs")) {
      logDataFrame.write().mode("append").insertInto("mylogs");// exception

    } else {
        logDataFrame.createOrReplaceTempView("mylogs"); // This is working fine
    }

    Dataset<Row> results = spark.sql("SELECT count(a1) FROM mylogs");

获得以下异常:

Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false;;
'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
+- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false

    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:352)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:350)

根据评论编辑代码:

    Dataset<Row> logDataFrame = sparkSession.createDataFrame(rowRDD, schema);

    if (sparkSession.catalog().tableExists("mylogs")) {
        logDataFrame.registerTempTable("temptable");
        sparkSession.sql("insert into table mylogs select * from temptable");
       //logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");
    } else {
        logDataFrame.createOrReplaceTempView("mylogs");
    }

    Dataset<Row> results = sparkSession.sql("SELECT count(a1) FROM mylogs");

低于错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false;;
'InsertIntoTable LogicalRDD [a1#4, b1#5, c1#6, d1#7], false, false, false
+- Project [a1#22, b1#23, c1#24, d1#25]
   +- SubqueryAlias temptable
      +- LogicalRDD [a1#22, b1#23, c1#24, d1#25], false

    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)

3 个答案:

答案 0 :(得分:0)

首先将您的数据框注册为tempTable

select [Less Total Backlog QTY] - Daily_Capacity from b where PrimaryKey = '2018-04-16CMKKB113' 

然后用

替换你的异常声明
logDataFrame.registerTempTable("temptable")

答案 1 :(得分:0)

您可以使用SparkSession API从文本文件创建Spark数据集。

根据您在评论中提供的数据样本,我创建了一个名为Log

的POJO
public class Log implements Serializable{

    private String col1;
    private String col2;
    private String col3;
    private String col4;
    private String col5;
    private String col6;
    private String col7;

    // getters and setters here

}

使用这个,我已经应用了flatmap将日志行转换为Log对象。

public class LogToDataset {

    public static void main(String[] args) {

        SparkSession spark = SparkSession.builder().appName("Log Job").master("spark://localhost:7077")
                .getOrCreate();

        Dataset<String> textDF = spark.read()
                .text("C:\\Users\\log4jFile.txt")
                .as(Encoders.STRING());
         JavaRDD<Log> logRDD = textDF.toJavaRDD().map(line -> {
             String[] data =line.split(" ");
             Log log = new Log();
             log.setCol1(data[0]);
             log.setCol2(data[1]);
             log.setCol3(data[2]);
             log.setCol4(data[3]);
             log.setCol5(data[4]);
             log.setCol6(data[5]);
             log.setCol7(data[6]);

             return log;
         });

         Dataset<Row> logDataset = spark.createDataFrame(logRDD, Log.class);
        logDataset.write().mode(SaveMode.Append).insertInto("hivelogtable");
         logDataset.createOrReplaceTempView("logtable");
         spark.sql("select * from logtable").show();
    }

}

现在,您应该可以使用其他人在评论中提到的insertInto()saveAsTable()将数据插入到表格中。

以下是用于测试此代码的示例数据

10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 
10-16 14:45:08.117 11342 30575 V Musicplayer: [Coming::512] 

最终输出:

这是我查询配置单元表时的输出。

+-----+------------+-----+-----+----+------------+-------------+
| col1|        col2| col3| col4|col5|        col6|         col7|
+-----+------------+-----+-----+----+------------+-------------+
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
|10-16|14:45:08.117|11342|30575|   V|Musicplayer:|[Coming::512]|
+-----+------------+-----+-----+----+------------+-------------+

答案 2 :(得分:0)

我希望它对某人有帮助,需要检查'mylogs'表是否存在,否则如果我们尝试在运行时直接使用'Append'键,它会抛出异常'mylogs'表不会退出。

并且不需要在'临时'表格中随意乱转。

    logDataFrame = spark.createDataFrame(rowRDD, schema);

    if (spark.catalog().tableExists("mylogs")) {
        logDataFrame.write().mode(SaveMode.Append).insertInto("mylogs");
    } else {
        logDataFrame.createOrReplaceTempView("mylogs");
    }