本地火花安装无法按预期工作

时间:2017-09-22 23:13:28

标签: python apache-spark pyspark amazon-dynamodb

我在本地计算机上安装了Hadoop-Spark。我尝试连接AWS S3并成功完成了这项工作。我为此目的使用了hadoop-aws-2.8.0.jar。但是,我一直在尝试使用EMR提供的jar文件DynamoDB连接到emr-ddb-hadoop.jar。我已经安装了所有AWS依赖项,并且可以在本地使用。但是,我一直在获得以下异常。

java.lang.ClassCastException: org.apache.hadoop.dynamodb.read.DynamoDBInputFormat cannot be cast to org.apache.hadoop.mapreduce.InputFormat

这是我的代码段。

import sys
import os

if 'SPARK_HOME' not in os.environ:
  os.environ['SPARK_HOME'] = "/usr/local/Cellar/spark"
  os.environ[
    'PYSPARK_SUBMIT_ARGS'] = '--jars /usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/tools/lib/emr-ddb-hadoop.jar,' \
                             '/home/aws-java-sdk/1.11.201/lib/aws-java-sdk-1.11.201.jar pyspark-shell'
  sys.path.append("/usr/local/Cellar/spark/python")
  sys.path.append("/usr/local/Cellar/spark/python")
  sys.path.append("/usr/local/Cellar/spark/python/lib/py4j-0.10.4-src.zip")

try:
  from pyspark.sql import SparkSession, SQLContext, Row
  from pyspark import SparkConf, SparkContext
  from pyspark.sql.window import Window
  import pyspark.sql.functions as func
  from pyspark.sql.functions import lit, lag, col, udf
  from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DoubleType, TimestampType, LongType
except ImportError as e:
  print("error importing spark modules", e)
  sys.exit(1)

spark = SparkSession \
    .builder \
    .master("spark://xxx.local:7077") \
    .appName("Sample") \
    .getOrCreate()
sc = spark.sparkContext
conf = {"dynamodb.servicename": "dynamodb", \
    "dynamodb.input.tableName": "test-table", \
    "dynamodb.endpoint": "http://dynamodb.us-east-1.amazonaws.com/", \
    "dynamodb.regionid": "us-east-1", \
    "mapred.input.format.class": "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat"}
dynamo_rdd = sc.newAPIHadoopRDD('org.apache.hadoop.dynamodb.read.DynamoDBInputFormat',
    'org.apache.hadoop.io.Text',
    'org.apache.hadoop.dynamodb.DynamoDBItemWritable',
    conf=conf)
dynamo_rdd.collect()

1 个答案:

答案 0 :(得分:0)

我没有使用newAPIHadoopRDD。使用旧的API它没有问题。

以下是我所遵循的工作样本

https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/