未定义名称“ sc”,

时间:2019-12-25 18:07:57

标签: python apache-spark pyspark data-science

2天前,我可以运行pyspark基本操作。 现在火花上下文不可用 sc。 我尝试了多个博客,但没有任何效果。 目前我有python 3.6.6,java 1.8.0_231和apache spark(with hadoop)spark-3.0.0-preview-bin-hadoop2.7

我正在尝试在Jupyter笔记本上运行简单命令

data = sc.textfile('airline.csv')
==> getting following error.
NameError                                 Traceback (most recent call last)
<ipython-input-2-572751a2bc2a> in <module>
----> 1 data = sc.textfile('airline.csv')

NameError: name 'sc' is not defined

我已经设置了以下系统变量集

HADOOP_HOME = C:\spark-3.0.0-preview-bin-hadoop2.7 
PYSPARK_DRIVER_PYTHON = ipython
PYSPARK_DRIVER_PYTHON_OPTS = notebook
SPARK_HOME = C:\spark-3.0.0-preview-bin-hadoop2.7
(java and python system variables are already set)
path = C:\spark-3.0.0-preview-bin-hadoop2.7\bin ( i have loaded winutils.exe in this folder)

现在,如果我删除PYSPARK_DRIVER_PYTHONPYSPARK_DRIVER_PYTHON_OPTS变量并在命令提示符下运行pyspark,则出现以下错误。

C:\spark-3.0.0-preview-bin-hadoop2.7>pyspark
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
19/12/25 23:28:36 WARN NativeCodeLoader: **Unable to load native-hadoop library for your platform... using builtin-java classes where applicable**
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/12/25 23:28:42 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
19/12/25 23:28:42 WARN Utils: Service 'sparkDriver' could not bind on a random free port. You may check whether configuring an appropriate binding address.
19/12/25 23:28:42

我也尝试为此找到解决方法,但无法解决。请帮助

2 个答案:

答案 0 :(得分:0)

我不知道为什么,但是这是如何工作的 我正在使用我公司的笔记本电脑。 当我使用Pulse secure连接到公司的网络时,我的Spark上下文成功连接。而当我连接到我的家庭网络时却没有。

奇怪,但这就是它对我有用的方式。

答案 1 :(得分:0)

@GiovaniSalazar 是对的。您需要导入

from pyspark.sql import SQLContext, Row, SparkSession

并定义 sc

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

然后参考sc

data = sc.textfile('airline.csv')

就你而言。