我正在尝试通过spark作业为hive表中的每条记录写入dynamodb。详细错误是
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 2.0 failed 4 times, most recent failure: Lost task 12.3 in stage 2.0 (TID 775, ip-10-0-0-xx.eu-west-1.compute.internal, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder
代码段如下:
object ObjName {
def main(args: Array[String]): Unit = {
print(classOf[com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder].getProtectionDomain().getCodeSource().getLocation().toURI().getPath())
val session = SparkSession.builder()
.appName("app_name")
.enableHiveSupport()
.getOrCreate()
import session.implicits._
session.sparkContext.setLogLevel("WARN")
session.sql("""
select
email,
name
from db.tbl
""").rdd.repartition(40)
.foreachPartition( iter => {
val random = new Random();
val client = AmazonDynamoDBClientBuilder.standard.withRegion(Regions.EU_WEST_1).withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials("access key", "secret key"))).build()
val dynamoDB = new DynamoDB(client)
val table = dynamoDB.getTable("table_name")
iter.foreach(row => {
val item = new Item().withPrimaryKey("email", row.getString(0)).withNumber("ts", (System.currentTimeMillis)*1000+random.nextInt(999+1)).withString("name", row.getString(1))
table.putItem(item)
})
})
}
}
maven依赖项:
<dependencies>
<!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-dynamodb -->
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-dynamodb</artifactId>
<version>1.11.170</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-core</artifactId>
<version>1.11.170</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.11.170</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>jmespath-java</artifactId>
<version>1.11.170</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.0</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
在main方法的开头,我打印了成功的类com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder
的jar文件位置,这意味着这个类在驱动程序节点上加载得很好。
另外,我执行jar tvf package.jar | grep -i AmazonDynamoDBClientBuilder --color
并确认此类在我的打包jar文件中。
提交spark作业的命令如下。无论是否添加--jars
,都会抱怨上面显示的相同错误。有什么建议?感谢。
spark-submit --class MainClassName --jars /mnt/home/hadoop/aws-java-sdk-dynamodb-1.11.170.jar,/mnt/home/hadoop/aws-java-sdk-core-1.11.170.jar,/mnt/home/hadoop/aws-java-sdk-s3-1.11.170.jar,/mnt/home/hadoop/jmespath-java-1.11.170.jar --driver-memory 3G --num-executors 20 --executor-memory 4G --executor-cores 4 package.jar