在YARN模式下使用Spark(1.4.0)在EMR(Hadoop 2.4.0)上通过IPython笔记本运行PySpark:
IPYTHON_OPTS="notebook --no-browser" nohup /usr/lib/spark/bin/pyspark --master yarn-client --num-executors 2 --executor-memory 512m --executor-cores 1 > /mnt/var/log/python_notebook.log 2> /mnt/var/log/python_notebook_err.log &
在HDFS中放置了一个简单的CSV文件,并尝试使用
进行读取sc.textFile('/tmp/text.csv').first()
但是,这给了我 Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
。
在上下文中:
Py4JJavaError Traceback (most recent call last)
<ipython-input-54-e39168c6841b> in <module>()
----> 1 sc.textFile('/tmp/text.csv').first()
/usr/lib/spark/python/pyspark/rdd.py in first(self)
1293 ValueError: RDD is empty
1294 """
-> 1295 rs = self.take(1)
1296 if rs:
1297 return rs[0]
/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1245 """
1246 items = []
-> 1247 totalParts = self.getNumPartitions()
1248 partsScanned = 0
1249
/usr/lib/spark/python/pyspark/rdd.py in getNumPartitions(self)
353 2
354 """
--> 355 return self._jrdd.partitions().size()
356
357 def filter(self, f):
/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o159.partitions.
: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:65)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:47)
at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 25 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:175)
at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
... 29 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)
... 31 more
我已尝试按照here的说明完成操作:
os.environ['SPARK_LIBRARY_PATH'] = "/usr/lib/hadoop-lzo/lib/native/"
os.environ['SPARK_CLASSPATH'] = "/usr/lib/hadoop-lzo/lib/"
然而,这似乎没有帮助。
答案 0 :(得分:3)
我知道这个问题已经过去了,但过去一周我一直在处理这个问题,所以我想我会发布我们的解决方案,以防其他人遇到这个问题。我们的设置是一个EC2实例作为EMR外部的驱动程序运行,然后可以创建EMR集群并与主服务器通信。群集运行Spark 2.2.0,EMR版本为5.9.0。
解决方案是克隆Spark驱动程序上的Twitter Hadoop-Lzo Github repo,然后将路径添加到hadoop-lzo.jar以激活提交args。 SUBMIT_ARGS='--jars /opt/hadoop-lzo/target/hadoop-lzo-0.4.21-SNAPSHOT.jar
。只需将.jar的路径替换为您将repo克隆到的地址。