我正在尝试从Google Dataproc连接到Amazon Kinesis Stream,但我只获得了空RDD。
Command: spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
详细日志:https://gist.github.com/sshrestha-datalicious/e3fc8ebb4916f27735a97e9fcc42136c
更多细节
Spark 1.6.1
Hadoop 2.7.2
使用的装配:/usr/lib/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar
令人惊讶的是,当我使用以下命令下载并使用包含SPARK 1.6.1和Hadoop 2.6.0的程序集时,这种方法很有用。
Command: SPARK_HOME=/opt/spark-1.6.1-bin-hadoop2.6 spark-submit --verbose --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.2 demo_kinesis_streaming.py --awsAccessKeyId XXXXX --awsSecretKey XXXX
我不确定两个hadoop版本和Kinesis ASL之间是否存在任何版本冲突,或者它与Google Dataproc的自定义设置有关。
任何帮助都将不胜感激。
感谢
苏伦
答案 0 :(得分:2)
我们的团队处于类似情况,我们设法解决了这个问题:
我们在相同的环境中运行:
一个简单的SparkStream Kinesis脚本,归结为:
# Run the script as
# spark-submit \
# --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1\
# demo_kinesis_streaming.py\
# --awsAccessKeyId FOO\
# --awsSecretKey BAR\
# ...
import argparse
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.storagelevel import StorageLevel
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
ap = argparse.ArgumentParser()
ap.add_argument('--awsAccessKeyId', required=True)
ap.add_argument('--awsSecretKey', required=True)
ap.add_argument('--stream_name')
ap.add_argument('--region')
ap.add_argument('--app_name')
ap = ap.parse_args()
kinesis_application_name = ap.app_name
kinesis_stream_name = ap.stream_name
kinesis_region = ap.region
kinesis_endpoint_url = 'https://kinesis.{}.amazonaws.com'.format(ap.region)
spark_context = SparkContext(appName=kinesis_application_name)
streamingContext = StreamingContext(spark_context, 60)
kinesisStream = KinesisUtils.createStream(
ssc=streamingContext,
kinesisAppName=kinesis_application_name,
streamName=kinesis_stream_name,
endpointUrl=kinesis_endpoint_url,
regionName=kinesis_region,
initialPositionInStream=InitialPositionInStream.TRIM_HORIZON,
checkpointInterval=60,
storageLevel=StorageLevel.MEMORY_AND_DISK_2,
awsAccessKeyId=ap.awsAccessKeyId,
awsSecretKey=ap.awsSecretKey
)
kinesisStream.pprint()
streamingContext.start()
streamingContext.awaitTermination()
代码已在AWS EMR和本地环境中使用与Hadoop 2.7设置相同的Spark 1.6.1进行了工作测试。
gcloud
命令提交工作; ssh
进入群集主节点并以yarn
客户端模式运行; ssh
进入群集主节点并以local[*]
运行。通过使用以下值更新/etc/spark/conf/log4.properties
来启用详细日志记录:
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
log4j.logger.org.eclipse.jetty=ERROR
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=DEBUG
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=DEBUG
log4j.logger.org.apache.spark=DEBUG
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=DEBUG
log4j.logger.org.spark-project.jetty.server.handler.ContextHandler=DEBUG
log4j.logger.org.apache=DEBUG
log4j.logger.com.amazonaws=DEBUG
我们注意到日志中有些奇怪(注意spark-streaming-kinesis-asl_2.10:1.6.1
使用aws-sdk-java/1.9.37
作为依赖,而某种方式使用aws-sdk-java/1.7.4
[由用户代理建议]):
16/07/10 06:30:16 DEBUG com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer: PROCESS task encountered execution exception:
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.consumeShard(ShardConsumer.java:126)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:334)
at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:174)
Caused by: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:119)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
content-length:282
content-type:application/x-amz-json-1.1
host:kinesis.ap-southeast-2.amazonaws.com
user-agent:SparkDemo,amazon-kinesis-client-library-java-1.4.0, aws-sdk-java/1.7.4 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.91-b14/1.8.0_91
x-amz-date:20160710T063016Z
x-amz-target:Kinesis_20131202.GetRecords
似乎DataProc已经使用更旧的AWS SDK构建了自己的Spark作为依赖关系,并且当与需要更新版本的AWS SDK的代码结合使用时它会爆炸,尽管我们不确定究竟是哪个模块导致了这个错误。
<强>更新强> 根据@ DennisHuo的评论,这种行为是由Hadoop的漏洞类路径引起的: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-project/pom.xml#L650
最糟糕的是,AWS KCL 1.4.0(由Spark 1.6.1使用)will suppress any runtime error silently而不是抛出RuntimeException
并在调试时引起很多麻烦。
最终,我们的解决方案是使用所有org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.1
阴影来构建我们的com.amazonaws.*
。
使用以下pom(更新spark/extra/kinesis-asl/pom.xml
)构建JAR,并在--jars
spark-submit
标记来shit新JAR
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.10</artifactId>
<version>1.6.1</version>
<relativePath>../../pom.xml</relativePath>
</parent>
<!-- Kinesis integration is not included by default due to ASL-licensed code. -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<packaging>jar</packaging>
<name>Spark Kinesis Integration</name>
<properties>
<sbt.project.name>streaming-kinesis-asl</sbt.project.name>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>${aws.kinesis.client.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-producer</artifactId>
<version>${aws.kinesis.producer.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalacheck</groupId>
<artifactId>scalacheck_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-test-tags_${scala.binary.version}</artifactId>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<artifactSet>
<includes>
<!-- At a minimum we must include this to force effective pom generation -->
<include>org.spark-project.spark:unused</include>
<include>com.amazonaws:*</include>
</includes>
</artifactSet>
<relocations>
<relocation>
<pattern>com.amazonaws</pattern>
<shadedPattern>foo.bar.YO.com.amazonaws</shadedPattern>
<includes>
<include>com.amazonaws.**</include>
</includes>
</relocation>
</relocations>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>