由于某些原因,当我在EMR上运行作业时,我的管道中的以下函数导致错误(使用emr-5.0.0
和Spark 2.0.0):
def autv(self, f=atf):
"""
Args:
f:
Returns:
"""
if not self._utv:
raise FileNotFoundError("Data not loaded.")
ut = self._utv
try:
self._utv = (ut
.rdd
.map(lambda x: (x.id, (x.t, x.w)))
.groupByKey()
.map(lambda x: Row(id=x[0],
w=len(x[1]),
t=DenseVector(f(x[1]))))
.toDF())
return self
except AttributeError as e:
logging.error(e)
return None
atf
是一个非常简单的功能:
def atf(iterable):
"""
Args:
iterable:
Returns:
"""
return [stats.mean(t) for t in zip(*list(zip(*iterable))[0])]
我收到了大量错误,但这是最后一部分:
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'regression'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
16/08/27 16:28:43 INFO ShutdownHookManager: Shutdown hook called
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644/pyspark-41867521-9dfd-4d8f-8b13-33272063e0c3
有一条ImportError: No module named 'regression'
消息对我没有意义,因为我的脚本的其余部分正在运行此模块中的函数,当我删除aggregate_user_topic_vectors
函数时,脚本运行时没有错误。另外,正如我之前所说,即使使用aggregate_user_topic_vectors
,脚本也会在本地计算机上运行而不会出错。我已经设置PYTHONPATH
来确定我的项目。真的不知道从哪里开始。任何意见将不胜感激。
答案 0 :(得分:0)
好吧,正如我所怀疑的那样,我的问题通过从myappMyController
(显然是邪恶的)转移到groupByKey
来解决(因此,与我导入模块的方式无关)。这是修改后的代码。希望这有助于某人!
reduceByKey