我正在尝试将两个numpy向量(从pyspark.ml PCA输出)转换为PySpark DataFrame,然后将该DataFrame写入我的Hive环境,但是看来我创建的DataFrame在某种程度上被破坏了,我不明白。
下面是一个产生错误的玩具示例;此示例的生产版本在我的Jupyter Notebook环境(PySpark 2.1)中成功,但是在生产集群上通过命令行(PySpark 2.2)运行时失败。
我无法在2.1到2.2的升级文档中找到任何内容来说明为什么可能存在此问题。
import numpy as np
import pandas as pd
spark = SparkSession.builder.getOrCreate()
A = np.array(range(10))
B = np.array(list("ABCDEFGHIJ"))
pdDF = pd.DataFrame(B, columns=(["B"]), index=A)
sDF = spark.createDataFrame(pdDF)
到目前为止,太好了。研究sDF的前身:
>>> A
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> B
array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
dtype='<U1')
>>> pdDF
B
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
我认为sDF架构看起来不错。
>>> sDF.schema
StructType(List(StructField(B,StringType,true)))
但是尝试“占据”两行会产生约100行我不理解的错误跟踪:
>>> sDF.take(2)
19/05/26 22:45:28 ERROR scheduler.TaskSetManager: Task 0 in stage 104.0 failed 4 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/pyspark/sql/dataframe.py", line 476, in take
return self.limit(num).collect()
File "/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/pyspark/sql/dataframe.py", line 438, in collect
sock_info = self._jdf.collectToPython()
File "/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/data/2/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1568.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 104.0 failed 4 times, most recent failure: Lost task 0.3 in stage 104.0 (TID 4267, anp-r01wn07.c03.hadoop.td.com, executor 74): java.io.IOException: Cannot run program "/usr/local/anaconda3/bin/python": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:169)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:95)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:69)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:132)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:380)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 29 more
自然,我希望看到数据行的顶部。此错误消息与尝试将表写入Hive时相同。
答案 0 :(得分:0)
基于此行中的错误:
java.io.IOException: Cannot run program "/usr/local/anaconda3/bin/python": error=2, No such file or directory
看看这个post是否对您有帮助。