将Spark数据帧从天蓝色数据块写入S3会导致java.lang.VerifyError:操作数堆栈错误类型错误

时间:2019-11-28 09:25:07

标签: azure hadoop databricks azure-databricks

我正在使用以下代码将Spark数据帧保存到S3(csv文件)

import traceback

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType

# Attached the spark submit command used
# spark-submit --master local[1] --packages org.apache.hadoop:hadoop-aws:2.7.3,
# com.amazonaws:aws-java-sdk-s3:1.11.98 my_file.py

ACCESS_KEY_ID = "xxxxxxxxxx"
SECRET_ACCESS_KEY = "yyyyyyyyyyyyy"
BUCKET_NAME = "zzzz"

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL data source example") \
    .getOrCreate()

df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()

try:
    spark.conf.set("fs.s3n.awsAccessKeyId", ACCESS_KEY_ID)
    spark.conf.set("fs.s3n.awsSecretAccessKey", SECRET_ACCESS_KEY)
    spark.conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

    output_directory = 's3n://' + BUCKET_NAME + '/' + str("azure_dbs")
    df.write.save(output_directory + '_csv', format='csv', header=True, mode="overwrite")
    print("Written successful")
except Exception as exp:
    print("Exception occurred")
    print(exp)
    print(traceback.format_exc())

当我从本地系统运行它时,它会成功写入S3(使用spark-submit)。 使用的spark-submit命令是

  

spark-submit --master local [1] --packages   org.apache.hadoop:hadoop-aws:2.7.3,   com.amazonaws:aws-java-sdk-s3:1.11.98 my_file.py

但是当我从azure databricks笔记本中将这些软件包作为作业运行时,将这些软件包添加为作业的依赖项时,我遇到了以下错误。


    py4j.protocol.Py4JJavaError: An error occurred while calling o252.save.
    : java.lang.VerifyError: Bad type on operand stack
    Exception Details:
      Location:
        org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V @152: invokevirtual
      Reason:
        Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is not assignable to 'org/jets3t/service/model/StorageObject'
      Current Frame:
        bci: @152
        flags: { }
        locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object' }
        stack: { 'org/jets3t/service/S3Service', 'java/lang/String', 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object', integer }
      Bytecode:
        0x0000000: b200 41b9 0067 0100 9900 36b2 0041 bb00
        0x0000010: 5959 b700 5a12 68b6 005b 2bb6 005b 1269
        0x0000020: b600 5b2c b600 5b12 6ab6 005b 2ab4 0023
        0x0000030: b600 3cb6 005b b600 5cb9 006b 0200 2ab4
        0x0000040: 0011 9900 302a b400 0c2a b400 232b 0101
        0x0000050: 0101 b600 6c4e 2ab4 001c 0994 9e00 162d
        0x0000060: b600 6d2a b400 1c94 9e00 0a2a 2d2c b600
        0x0000070: 6eb1 bb00 2a59 2cb7 002b 4e2d 2ab4 001f
        0x0000080: b600 302a b400 0c2a b400 23b6 003c 2b2a
        0x0000090: b400 23b6 003c 2d03 b600 6f57 a700 0a4e
        0x00000a0: 2a2d 2bb7 0035 b1                      
      Exception Handler Table:
        bci [0, 113] => handler: 159
        bci [114, 156] => handler: 159
      Stackmap Table:
        same_frame(@62)
        same_frame(@114)
        same_locals_1_stack_item_frame(@159,Object[#216])
        same_frame(@166)

        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:342)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:332)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at com.databricks.sql.transaction.tahoe.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:103)
        at com.databricks.sql.transaction.tahoe.DeltaValidation$.validateNonDeltaWrite(DeltaValidation.scala:94)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:261)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:235)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
        at py4j.Gateway.invoke(Gateway.java:295)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:251)
        at java.lang.Thread.run(Thread.java:748)

(在从蔚蓝数据砖中将其作为笔记本作业运行时,我没有像在本地计算机场景中那样创建新的spark对象,而是使用由数据砖提供的现有spark。)

错误原因是什么。从azure databricks运行此程序时,是否需要任何其他程序包?

包含的Spark提交软件包:

  • org.apache.hadoop:hadoop-aws:2.7.3,
  • com.amazonaws:aws-java-sdk-s3:1.11.98

本地计算机:
Python 3.6
Spark版本2.4.4使用Scala版本2.11.12

数据块详细信息:
群集信息:
5.5 LTS(包括Apache Spark 2.4.3,Scala 2.11)
Python 3(3.5)

1 个答案:

答案 0 :(得分:0)

在Azure数据块中,似乎我们需要更新用于设置配置的密钥。请参考answer给出的Carlos David Peña


我们需要使用键“ spark.hadoop.fs.s3n.impl” 代替“ fs.s3n.impl”

注意:无需在工作中明确添加任何依赖库。(azure数据块)