面临Spark结构化流中的问题

时间:2019-11-12 05:32:22

标签: java apache-spark apache-spark-sql bigdata spark-streaming

我已经编写了一个代码来读取csf文件,并使用Spark Stuctured Stream在控制台上打印该文件。代码如下-


    import java.util.ArrayList;
    import java.util.List;

    import org.apache.spark.api.java.function.FlatMapFunction;
    import org.apache.spark.sql.*;
    import org.apache.spark.sql.streaming.StreamingQuery;
    import org.apache.spark.sql.Encoders;
    import org.apache.spark.sql.types.StructType;
    import com.cybernetix.models.BaseDataModel;

    public class ReadCSVJob   {

        static List<BaseDataModel>  bdmList=new ArrayList<BaseDataModel>();

        public static void main(String args[]) {

             SparkSession spark = SparkSession
                      .builder()
                      .config("spark.eventLog.enabled", "false")
                      .config("spark.driver.memory", "2g")
                      .config("spark.executor.memory", "2g")
                      .appName("StructuredStreamingAverage")
                      .master("local")
                      .getOrCreate();



            StructType userSchema = new StructType();
            userSchema.add("name", "string");
            userSchema.add("status", "String");
            userSchema.add("u_startDate", "String");
            userSchema.add("u_lastlogin", "string");
            userSchema.add("u_firstName", "string");
            userSchema.add("u_lastName", "string");
            userSchema.add("u_phone","string");
            userSchema.add("u_email", "string")
                    ;

            Dataset<Row> dataset = spark.
                    readStream().
                    schema(userSchema)
                    .csv("D:\\user\\sdata\\user-2019-10-03_20.csv");


            dataset.writeStream()
            .format("console")
            .option("truncate","false")
            .start();


        }

    }

在此代码行中 userSchema.add(“ name”,“ string”); 导致程序终止。下面是日志跟踪。

ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7 used for parser compilation does not match the current runtime version 4.5.3Exception in thread "main" java.lang.ExceptionInInitializerError  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:84)   at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:39)   at org.apache.spark.sql.types.StructType.add(StructType.scala:213)  at com.cybernetix.sparks.jobs.ReadCSVJob.main(ReadCSVJob.java:45) Caused by: java.lang.UnsupportedOperationException: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).   at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153)   at org.apache.spark.sql.catalyst.parser.SqlBaseLexer.<clinit>(SqlBaseLexer.java:1175)   ... 4 more Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).   ... 6 more

我已在pom.xml文件中添加了ANTLR maven依赖关系,但仍然面临相同的问题。

<!-- https://mvnrepository.com/artifact/org.antlr/antlr4 -->
<dependency>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4</artifactId>
    <version>4.7</version>
</dependency>

我不确定添加antlr依赖关系后,为什么在maven依赖关系列表中仍然是antlr-runtime-4.5.3.jar。看看下面的屏幕截图。

enter image description here

有人可以帮我在这里做错什么吗?

1 个答案:

答案 0 :(得分:0)

将您的artifactId更新为antlr4-runtime,然后重试。请cleanbuild

依赖性应如下所示:

<dependency>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4-runtime</artifactId>
    <version>4.7</version>
</dependency>