我试图在我的Windows 10机器上安装spark。我有一个python 2.7的anacondo2。我设法打开了ipython笔记本实例。我能够运行以下几行:
airlines=sc.textFile("airlines.csv")
print (airlines)
但是当我跑步时出现错误:airlines.first()
这是我得到的错误:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-6-85a5d6f5110f> in <module>()
----> 1 airlines.first()
C:\spark\python\pyspark\rdd.py in first(self)
1326 ValueError: RDD is empty
1327 """
-> 1328 rs = self.take(1)
1329 if rs:
1330 return rs[0]
C:\spark\python\pyspark\rdd.py in take(self, num)
1308
1309 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1310 res = self.context.runJob(self, takeUpToNumLeft, p)
1311
1312 items += res
C:\spark\python\pyspark\context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
932 mappedRDD = rdd.mapPartitions(partitionFunc)
933 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
--> 934 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
935
936 def show_profiles(self):
C:\spark\python\pyspark\rdd.py in _load_from_socket(port, serializer)
137 break
138 if not sock:
--> 139 raise Exception("could not open socket")
140 try:
141 rf = sock.makefile("rb", 65536)
Exception: could not open socket
执行时遇到不同的错误:airlines.collect()
这是错误:
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-5-3745b2fa985a> in <module>()
1 # Using the collect operation, you can view the full dataset
----> 2 airlines.collect()
C:\spark\python\pyspark\rdd.py in collect(self)
775 with SCCallSiteSync(self.context) as css:
776 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
--> 777 return list(_load_from_socket(port, self._jrdd_deserializer))
778
779 def reduce(self, f):
C:\spark\python\pyspark\rdd.py in _load_from_socket(port, serializer)
140 try:
141 rf = sock.makefile("rb", 65536)
--> 142 for item in serializer.load_stream(rf):
143 yield item
144 finally:
C:\spark\python\pyspark\serializers.py in load_stream(self, stream)
515 try:
516 while True:
--> 517 yield self.loads(stream)
518 except struct.error:
519 return
C:\spark\python\pyspark\serializers.py in loads(self, stream)
504
505 def loads(self, stream):
--> 506 length = read_int(stream)
507 if length == SpecialLengths.END_OF_DATA_SECTION:
508 raise EOFError
C:\spark\python\pyspark\serializers.py in read_int(stream)
541
542 def read_int(stream):
--> 543 length = stream.read(4)
544 if not length:
545 raise EOFError
C:\Users\AS\Anaconda2\lib\socket.pyc in read(self, size)
382 # fragmentation issues on many platforms.
383 try:
--> 384 data = self._sock.recv(left)
385 except error, e:
386 if e.args[0] == EINTR:
error: [Errno 10054] An existing connection was forcibly closed by the remote host
请帮忙。
答案 0 :(得分:0)
在Windows 10上安装PYSPARK 带有ANACONDA NAVIGATOR的JUPYTER-NOTEBOOK
下载软件包
1)spark-2.2.0-bin-hadoop2.7.tgz Download
2)java jdk 8版本Download
3)Anaconda v 5.2 Download
4)scala-2.12.6.msi Download
5)hadoop v2.7.1 Download
在 C:/ 中制作火花文件夹,驱动并放入其中的所有内容 It will look like this
注意:在安装SCALA时将SCALA放入火花文件夹内的路径
现在设置新的Windows环境变量
HADOOP_HOME=C:\spark\hadoop
JAVA_HOME=C:\Program Files\Java\jdk1.8.0_151
SCALA_HOME=C:\spark\scala\bin
SPARK_HOME=C:\spark\spark\bin
PYSPARK_PYTHON=C:\Users\user\Anaconda3\python.exe
PYSPARK_DRIVER_PYTHON=C:\Users\user\Anaconda3\Scripts\jupyter.exe
PYSPARK_DRIVER_PYTHON_OPTS=notebook
立即选择火花路径:编辑并添加新内容
在变量“ Path” Windows中添加“ C:\ spark \ spark \ bin ”
就这样,您的浏览器将使用Juypter localhost弹出
检查pyspark是否正常工作!
输入简单代码并运行
from pyspark.sql import Row
a = Row(name = 'Vinay' , age=22 , height=165)
print("a: ",a)