我是python和Spark的新手,在这里我正在尝试播放spark Rtree Index。当我尝试使用mapPartitions函数广播索引时,它会错误地显示以下错误
Windows上的:
文件“avlClass.py”,第42行,在avlFileLine
中for j in bv.intersection([x_meters-buffer_size,y_meters-buffer_size,x_meters
buffer_size,y_meters+buffer_size]):
File "C:\Python27\ArcGIS10.3\lib\site-packages\rtree\index.py", line 440, in in
tersection p_mins, p_maxs = self.get_coordinate_pointers(coordinates)
File "C:\Python27\ArcGIS10.3\lib\site-packages\rtree\index.py", line 294, in ge
t_coordinate_pointers
dimension = self.properties.dimension
File "C:\Python27\ArcGIS10.3\lib\site-packages\rtree\index.py", line 883, in ge
t_dimension
return core.rt.IndexProperty_GetDimension(self.handle)
indowsError: exception: access violation reading 0x00000004
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala
166)
在Linux中:
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
file:avlClass.py
import fiona
from shapely.geometry import shape,Point, LineString, Polygon
from shapely.ops import transform
from rtree import index
from numpy import math
import os
import pyproj
from functools import partial
from pyspark import SparkContext, SparkConf
class avlClass(object):
def __init__(self,name):
self.name=name
def create_index(self):
# Read the ESRI Shape File
shapeFileName='C:\\shapefiles\\Road.shp'
polygons= [ pol for pol in fiona.open(shapeFileName,'r') ]
p=index.Property()
p.dimension=2
self_idx=index.Index(property=p)
# Create Index Entries
for pos,features in enumerate(polygons):
self_idx.insert(pos,LineString(features['geometry'] ['coordinates']).bounds )
return self_idx
def avlFileLine(self,iter,bv):
for line in iter:
splits =line.split(',')
lat= float(splits[2])
long= float(splits[3])
print lat,long
x='No'
# Test the index from broadcast Variable bv
buffer_size=10
x_meters=-9511983.32151
y_meters=4554613.80307
for j in bv.intersection([x_meters-buffer_size,y_meters-buffer_size,x_meters+buffer_size,y_meters+buffer_size]):
x= "FOUND"
yield lat,long,heading_radians,x
文件:avlSpark.py
import fiona
from shapely.geometry import shape,Point, LineString, Polygon
from shapely.ops import transform
from rtree import index
from numpy import math
import os
import pyproj
from functools import partial
from pyspark import SparkContext, SparkConf
from avlClass import avlClass
if __name__ == '__main__':
conf = SparkConf().setAppName('AVL_Spark_Job')
conf = SparkConf().setMaster('local[*]')
sc= SparkContext(conf=conf)
sc.addPyFile("avlClass.py")
test_avlClass=avlClass("Test")
print test_avlClass.name
idx= test_avlClass.create_index()
# Test the created index
buffer_size=10
x_meters=-9511983.32151
y_meters=4554613.80307
for j in idx.intersection([x_meters-buffer_size,y_meters-buffer_size,x_meters+buffer_size,y_meters+buffer_size]):
print "FOUND" # Index Worked
# broadcast Index for Partitions
idx2=sc.broadcast(idx)
FileName='c:\\test\\file1.txt'
avlFile=sc.textFile(FileName).mapPartitions(lambda line: test_avlClass.avlFileLine(line,idx2.value))
for line in avlFile.take(10):
print line
答案 0 :(得分:0)
我所看到的是你正在创建一个广播变量:
# broadcast Index for Partitions
idx2=sc.broadcast(idx)
然后将其.value传递给AvlFileLine:
avlFile=sc.textFile(FileName).mapPartitions(lambda line: test_avlClass.avlFileLine(line,idx2.value))
但idx或idx2都不是RDD。 idx2,作为广播变量,will take on whatever class idx is. (I actually asked this question based on your question :)
您仍然将传递的参数视为广播变量,然后尝试将其视为RDD,presumably a PythonRDD which as noted, it's not.广播变量不是RDD,它只是无论如何您指定给它的类型。此外,您将其值(使用.value()
)传递给AVLFileLine。
所以当你在它上面调用intersection()时,它会爆炸。我很惊讶它没有更好地爆炸,但是我在Java中工作,编译器会抓住它,我我假设在Python中解释器只是愉快地运行,直到它遇到一个糟糕的内存位置,你得到那个丑陋的错误信息:)
我认为最好的方法是从一开始就重新考虑你的代码,它只是不正确使用Spark。我不太了解您的具体应用程序,所以我最好的猜测是您需要放弃intersection()而不是look again at the RDD programming guide part of the Spark docs for Python。找到一种方法,将value
idx2
应用于avlfile
,这是一个RDD。你需要避免传递函数中的任何for循环,Spark正在做" for"通过应用传递给RDD的每个元素的任何函数来循环。请记住,结果将是另一个RDD。
在 psuedo-Java 代码中,它看起来像:
SomeArray theArray = avlfile.map({declare inline or call function}).collect(<if the RDD is not too big for collect>)
我希望如果你还没有这样做,a great book is Learning Spark by O'Reilly和the sample code for the book,那就是the Apache Spark Docs的下一步。学习Spark可以低于10美元租用,在我的情况下,我通过Safari Books免费获得它作为大学生。
如果您不习惯在函数式编程方面思考,编写Spark程序有一个陡峭的学习曲线,我不是那么遥远,恕我直言,您还没有完全理解Spark编程模型。我希望这一切都有所帮助。
同样如本答案的原始编辑中所述,您对SparkConf的调用是错误的,I had to go back a ways in the docs (.9)可以查找示例,但您想要这样的内容:
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local")
.setAppName("My app")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
从独立程序部分开始......现在我相信你的第二个任务会覆盖第一个。
总结:我不明白你会在所有工作人员的broadcast variable, a broadcast variable is NOT an RDD but simply a data structure like a global that is read(而不是写入)上调用RDD功能。 Per the Broadcast class in Scala
From the docs on Broadcast Variables:
>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>
>>> broadcastVar.value
[1, 2, 3]
在我看来,在bv(不是RDD)上调用intersection()并不合理。