Apache Spark找不到类CSVReader

时间:2016-09-25 07:08:00

标签: java maven intellij-idea apache-spark

我尝试解析一个简单的csv文件的代码如下所示:

SparkConf conf = new SparkConf().setMaster("local").setAppName("word_count");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csv = sc.textFile("/home/user/data.csv");

JavaRDD<String[]> parsed = csv.map(x-> new CSVReader(new StringReader(x)).readNext());
parsed.foreach(x->System.out.println(x));  

但是,Spark作业以找不到类的异常结束,说无法找到CSVReader。我的pom.xml看起来像这样:

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>3.8</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

  

如果您的代码依赖于其他项目,则需要打包它们   与您的应用程序一起将代码分发给Spark   簇。为此,请创建一个包含的装配jar(或“uber”jar)   您的代码及其依赖项。 sbt和Maven都有组装   插件。创建程序集jar时,请将Spark和Hadoop列为   提供依赖;这些不需要捆绑,因为它们是   由集群管理器在运行时提供   资料来源:http://spark.apache.org/docs/latest/submitting-applications.html

Maven在将项目打包到JAR时不会发送依赖项JAR。为了发送依赖JAR,我添加了Maven Shade插件。

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
        <finalName>${project.artifactId}-${project.version}</finalName>
    </configuration>
</plugin>  

另见:How to make it easier to deploy my Jar to Spark Cluster in standalone mode?