我正在尝试运行使用Apache Spark的Java项目。我将CSV文件中的数据读入数据集。如果我从Eclipse运行代码,那么每次运行都可以。我将项目配置为使得具有所有依赖性的单个jar。
如果我使用java -jar ...
运行jar文件,则会发生这种情况:
exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: csv. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:594)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
at access.DocumentsSparkAccess.getInstance(DocumentsSparkAccess.java:32)
at process.TopicModelCreator.<init>(TopicModelCreator.java:38)
at main.Main.createTopicModel(Main.java:56)
at main.Main.main(Main.java:37)
Caused by: java.lang.ClassNotFoundException: csv.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
我使用以下版本:
我使用像这样的Maven程序集插件:
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<archive>
<manifest>
<mainClass>main.Main</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
依赖关系包含如下:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.3.0</version>
</dependency>
答案:
Parquet文件已经解决了这个问题: "Failed to find data source: parquet" when making a fat jar with maven