将DataFrame写入Avro文件

时间:2020-04-14 15:05:32

标签: pyspark pyspark-sql avro spark-avro

我在服务器上这样启动pyspark命令shell:

pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0

然后我加载了数据框:

products = spark.read.csv('products.csv',header=True,inferSchema=True)

当我尝试保存此数据框时:

products.write.format('avro').save('prods.avro')

我得到了这个日志:

Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File

“ / usr / hdp / 2.5.0.0-1245 / spark2 / python / pyspark / sql / readwriter.py”,行 532,保存中 self._jwrite.save(path)文件“ /usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py”, 第933行,在通话文件中 “ /usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.py”,第63行, 在装饰 返回f(* a,** kw)文件“ /usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py”, get_return_value py4j.protocol.Py4JJavaError中的第312行:错误 发生在调用o49.save时。 :java.lang.NoClassDefFoundError: org / apache / spark / sql / catalyst / util / CaseInsensitiveMap $ 在org.apache.spark.sql.avro.AvroOptions。(AvroOptions.scala:34) 在org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:115) 在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ 4.apply(InsertIntoHadoopFsRelationCommand.scala:121)中 在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ 4.apply(InsertIntoHadoopFsRelationCommand.scala:121)中 在org.apache.spark.sql.execution.datasources.BaseWriterContainer.driverSideSetup(WriterContainer.scala:105) 在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1.apply $ mcV $ sp(InsertIntoHadoopFsRelationCommand.scala:140)中 位于org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:115) 位于org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1.apply(InsertIntoHadoopFsRelationCommand.scala:115) 在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57) 在org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)处 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:60) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:115) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:115) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:136) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:86) 在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) 在org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487) 在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) 在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498) 在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 在py4j.Gateway.invoke(Gateway.java:280) 在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 在py4j.commands.CallCommand.execute(CallCommand.java:79) 在py4j.GatewayConnection.run(GatewayConnection.java:211) 在java.lang.Thread.run(Thread.java:745)导致原因:java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap $ 在java.net.URLClassLoader.findClass(URLClassLoader.java:381) 在java.lang.ClassLoader.loadClass(ClassLoader.java:424) 在java.lang.ClassLoader.loadClass(ClassLoader.java:357) ...另外35个

似乎是问题所在。

0 个答案:

没有答案