我正在尝试在EMR集群上运行hadoop作业。它作为Java命令运行,我使用Error: java.lang.IllegalArgumentException: invalid method sqlContext.sql for object 12
at sparklyr.Invoke$.invoke(invoke.scala:113)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:89)
at sparklyr.StreamHandler$.read(stream.scala:55)
at sparklyr.BackendHandler.channelRead0(handler.scala:49)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
。该工作从Teradata中提取数据,我假设Teradata相关的jar也包含在jar-with-dependencies中。但是,我仍然有例外:
jar-with-dependencies
我的Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:171)
具有以下相关依赖关系:
pom
我正在包装完整的jar,如下:
<dependency>
<groupId>teradata</groupId>
<artifactId>terajdbc4</artifactId>
<version>14.10.00.17</version>
</dependency>
<dependency>
<groupId>teradata</groupId>
<artifactId>tdgssconfig</artifactId>
<version>14.10.00.17</version>
</dependency>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<compilerArgument>-Xlint:-deprecation</compilerArgument>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2.1</version>
<configuration>
<descriptors>
</descriptors>
<archive>
<manifest>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
档案:
assembly.xml
以下列方式运行EMR命令:
<assembly>
<id>aws-emr</id>
<formats>
<format>jar</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<dependencySets>
<dependencySet>
<unpack>false</unpack>
<includes>
</includes>
<scope>runtime</scope>
<outputDirectory>lib</outputDirectory>
</dependencySet>
<dependencySet>
<unpack>true</unpack>
<includes>
<include>${groupId}:${artifactId}</include>
</includes>
</dependencySet>
</dependencySets>
</assembly>
有没有办法可以指定Teradata jar,以便在执行map-reduce作业时将它们添加到类路径中?
编辑:我确认缺少的类打包在jar-with-dependencies中。
aws emr create-cluster --release-label emr-5.3.1 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=5,BidPrice=0.1,InstanceType=m3.xlarge \
--service-role EMR_DefaultRole --log-uri s3://my-bucket/logs \
--applications Name=Hadoop --name TeradataPullerTest \
--ec2-attributes <ec2-attributes> \
--steps Type=CUSTOM_JAR,Name=EventsPuller,Jar=s3://path-to-jar-with-dependencies.jar,\
Args=[com.my.package.EventsPullerMR],ActionOnFailure=TERMINATE_CLUSTER \
--auto-terminate
答案 0 :(得分:0)
我还没有完全解决这个问题,但找到了一种方法来完成这项工作。理想的解决方案应该是在超级罐中包装teradata罐子。这仍然在发生,但这些罐子不知何故不会被添加到类路径中。我不确定为什么会这样。
我通过创建2个独立的jar来解决这个问题 - 一个用于我的代码包,另一个用于所有需要的依赖项。我将这两个罐子上传到S3,然后写了一个脚本,它执行以下操作(伪代码):
# download main jar
aws s3 cp <s3-path-to-myjar.jar> .
# download dependency jar in a temp directory
aws s3 cp <s3-path-to-dependency-jar> temp
# unzip the dependencies jar into another directory (say `jars`)
unzip -j temp/dependencies.jar <path-within-jar-to-unzip>/* -d jars
LIBJARS=`find jars/*.jar | tr -s '\n' ','`
HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
CLASSPATH=$HADOOP_CLASSPATH
export CLASSPATH HADOOP_CLASSPATH
# run via hadoop command
hadoop jar myjar.jar com.my.package.EventsPullerMR -libjars ${LIBJARS} <arguments to the job>
这开始了这项工作。