我正在尝试将 spark 数据帧转换为 Pandas 数据帧。但是出现错误。
Spark 会话的配置
{
"conf":
{
"spark.pyspark.python": "python",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv"
}
}
代码
import os
os.environ['http_proxy'] = "http://10.10.10.1:443"
os.environ['https_proxy'] = "http://10.10.10.1:443"
sc.install_pypi_package("pandas")
del os.environ['http_proxy']
del os.environ['https_proxy']
some_df = sc.parallelize([
("A", "no"),
("B", "yes"),
("B", "yes"),
("B", "no")]
).toDF(["user_id", "phone_number"])
pandas_df = some_df.toPandas()
在使用以下代码运行上面的代码之前,我正在验证我可以看到 pandas
包。
sc.list_packages()
然而,当它运行代码时,它似乎也试图在所有执行器上安装 Pandas,但它失败了。
错误
Could not find a version that satisfies the requirement pandas (from versions: )
No matching distribution found for pandas
21/03/03 15:56:53 INFO VirtualEnvFactory: Start to setup virtualenv...
21/03/03 15:56:53 INFO VirtualEnvFactory: Running command:/usr/bin/virtualenv -p python --system-site-packages virtualenv_application_1609789675521_0076_0
21/03/03 15:56:53 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 4)
java.lang.RuntimeException: Failed to run command: virtualenv_application_1609789675521_0076_0/bin/python -m pip install pandas
at org.apache.spark.api.python.VirtualEnvFactory.org$apache$spark$api$python$VirtualEnvFactory$$execCommand(VirtualEnvFactory.scala:120)
at org.apache.spark.api.python.VirtualEnvFactory$$anonfun$setupVirtualEnv$6.apply(VirtualEnvFactory.scala:98)
at org.apache.spark.api.python.VirtualEnvFactory$$anonfun$setupVirtualEnv$6.apply(VirtualEnvFactory.scala:86)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.api.python.VirtualEnvFactory.setupVirtualEnv(VirtualEnvFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.<init>(PythonWorkerFactory.scala:85)
at org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:118)
at org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:118)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:79)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:109)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/03/03 15:56:53 INFO CoarseGrainedExecutorBackend: Got assigned task 5
21/03/03 15:56:53 INFO Executor: Running task 1.1 in stage 3.0 (TID 5)
Spark UI 截图
更新: 我无法从我的火花机访问互联网。这就是导致问题的原因。有什么办法可以用代理配置虚拟环境吗?
21/03/03 16:57:12 INFO VirtualEnvFactory: Running command:virtualenv_application_1609789675521_0077_0/bin/python -m pip install pandas
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f527ee470f0>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/pandas/
Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f527ee47748>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/pandas/
Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f527ee47710>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/pandas/
Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f527ee47f60>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/pandas/
Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f527ee47a20>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/pandas/
Could not find a version that satisfies the requirement pandas (from versions: )
No matching distribution found for pandas
2