我偶尔遇到PySpark的问题。当我为特定的RDD收集()时,它返回所有值,但是当我尝试映射时,它返回[]。 最奇怪的部分是,可能五分之一会在相同会话中正确返回相同RDD的值,而不会进行任何更改。听起来是不可能的,而且可能是……
关键点:
>>> pairs.collect()
[('b', ('d', 3)), ('c', ('d', 2)), ('g', ('d', -2)), ('b', ('z', 1)),
('a', ('b', 4)), ('a', ('c', 3)), ('b', ('c', 2)), ('b', ('f', -1)),
('a', ('g', -2)), ('c', ('z', -4))]
>>> pairs.map(lambda x: x).collect()
[]
>>> pairs.flatMap(lambda x: x).collect()
[]
问题似乎出现在while循环的第四行,“ new_uvs = pair.flatMap(lambda x:(x [0],x [1] [0]))。collect()”作为此行什么也不返回。这是我的完整代码:
sc = spark.sparkContext
l = [("d", ("e",1)),("b", ("d",3)),("c", ("d",2)),("g", ("d",-2)),
("b", ("z",1)),("a", ("b",4)),("a", ("c",3)),("b", ("c",2)),("b",
("f",-1)),("a",
("g",-2)),
("a", ("f",3)),("c", ("z",-4)), ("x", ("y",0))]
network = sc.parallelize(l)
#source and destinations
src = sc.broadcast('a')
dest = sc.broadcast('e')
#collect all u's and v's in the node path thus far into rdd
pairs = network.filter(lambda x: x[0]==dest.value or x[1] .
[0]==dest.value)
#store pairs in a list so you can add new pairs
path_pool_static = pairs.collect()
path_pool = sc.broadcast(path_pool_static)
#collect all pairs with src in it
uvs = sc.broadcast(pairs.flatMap(lambda x: (x[0], x[1][0])).collect())
network = network.filter(lambda x: x not in path_pool.value)
while network.filter(lambda x: x[0]==src.value or x[1] .
[0]==src.value).collect()!=[]:
pairs = network.filter(lambda x: x[0] in uvs.value or x[1][0] in
uvs.value)
#intialize uvs_static as uvs list, add new pairs u's and v's to
list and broadcast it
uvs_static = uvs.value
new_uvs = pairs.flatMap(lambda x: (x[0], x[1][0])).collect()
uvs_static.extend(new_uvs)
uvs = sc.broadcast(uvs_static)
#update path_static with new pairs and broadcast it
path_pool_static.extend(pairs.collect())
path_pool = sc.broadcast(path_pool_static)
#remove pairs already in path
network = network.filter(lambda x: x not in path_pool.value)