我在本地计算机上安装了Hadoop-Spark
。我尝试连接AWS S3
并成功完成了这项工作。我为此目的使用了hadoop-aws-2.8.0.jar
。但是,我一直在尝试使用EMR提供的jar文件DynamoDB
连接到emr-ddb-hadoop.jar
。我已经安装了所有AWS依赖项,并且可以在本地使用。但是,我一直在获得以下异常。
java.lang.ClassCastException: org.apache.hadoop.dynamodb.read.DynamoDBInputFormat cannot be cast to org.apache.hadoop.mapreduce.InputFormat
这是我的代码段。
import sys
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = "/usr/local/Cellar/spark"
os.environ[
'PYSPARK_SUBMIT_ARGS'] = '--jars /usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/tools/lib/emr-ddb-hadoop.jar,' \
'/home/aws-java-sdk/1.11.201/lib/aws-java-sdk-1.11.201.jar pyspark-shell'
sys.path.append("/usr/local/Cellar/spark/python")
sys.path.append("/usr/local/Cellar/spark/python")
sys.path.append("/usr/local/Cellar/spark/python/lib/py4j-0.10.4-src.zip")
try:
from pyspark.sql import SparkSession, SQLContext, Row
from pyspark import SparkConf, SparkContext
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import lit, lag, col, udf
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DoubleType, TimestampType, LongType
except ImportError as e:
print("error importing spark modules", e)
sys.exit(1)
spark = SparkSession \
.builder \
.master("spark://xxx.local:7077") \
.appName("Sample") \
.getOrCreate()
sc = spark.sparkContext
conf = {"dynamodb.servicename": "dynamodb", \
"dynamodb.input.tableName": "test-table", \
"dynamodb.endpoint": "http://dynamodb.us-east-1.amazonaws.com/", \
"dynamodb.regionid": "us-east-1", \
"mapred.input.format.class": "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat"}
dynamo_rdd = sc.newAPIHadoopRDD('org.apache.hadoop.dynamodb.read.DynamoDBInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.dynamodb.DynamoDBItemWritable',
conf=conf)
dynamo_rdd.collect()
答案 0 :(得分:0)
我没有使用newAPIHadoopRDD。使用旧的API它没有问题。
以下是我所遵循的工作样本
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/