我试图用Python在Apache-Spark(1.4.0)中实现一个朴素的PageRank版本。可以找到算法的细节(它的工作方式)here(在静止向量I的矩阵H处向下看三分之一)。
PageRank是迭代的(在每个步骤中,每个顶点向它的邻居发出它的当前页面等级的份额,然后reduce函数收集发送到每个顶点的页面等级)。这导致循环,其中RDD被循环和更新(RDD是只读的,因此它实际上创建了新的RDD)。原则上,人们应该能够使用 /**
* Removes an attribute from this request. This method is not generally
* needed as attributes only persist as long as the request is being
* handled.
*/
public void removeAttribute(String name);
。
我遇到的问题是,在我的循环中,即使使用RDD.cache()
,也会在循环的每次迭代中重新计算RDD。我知道RDD适合在内存中(因为我使用的输入文件非常小,RDD只有8个元素)。我的代码如下:
.cache()
为了说明为什么我认为缓存不起作用,我将上面的代码定时,循环的迭代次数为5,10,15,...,40。然后,我更改了代码在每个步骤中,我使用from pyspark import SparkContext, SparkConf
import sys
def my_map(line):
#The structure of line is (vertex, ([list of outgoing edges], current_PageRank))
out_edges = line[1][0]
current_PageRank = line[1][1]
e = len(out_edges)
if e > 0:
return_list = []
for f in out_edges:
return_list.append( (f, current_PageRank/float(e)) )
return return_list
conf = SparkConf().setAppName("PageRank")
sc = SparkContext(conf=conf)
fileName = sys.argv[1]
# lines is an RDD (list) where each element of the RDD is a string (one line of the text file).
lines = sc.textFile(fileName)
# edge_list is an RDD where each element is the list of integers from a line of the text file.
# edge_list is cached because we will refer to it numerous times throughout the computation.
# Each element of edge_list is of the form (vertex, [out neighbors]), so (int, list).
edge_list = lines.map(lambda line: (int(line.split()[0]), [int(x) for x in line.split()[1:]]) ).cache()
# vertex_set is an RDD that is the list of all vertices.
vertex_set = edge_list.map(lambda row: row[0])
# N is the number of vertices in the graph.
N = vertex_set.count()
# Initialize the PageRank vector.
# Each vertex will be keyed with its initial value (1/N where N is the number of vertices)
# Elements of Last_PageRank have the form (vertex, PageRank), so (int, float).
Last_PageRank = vertex_set.map(lambda x: (x, 1.0/N) ).cache()
for number in xrange(40):
Last_PageRank = edge_list.join(Last_PageRank).flatMap(my_map).reduceByKey(lambda a, b: a+b).cache()
### In version 2, I comment the previous and last line out, and un-comment the following 3 lines.
#LList = edge_list.join(Last_PageRank).flatMap(my_map).reduceByKey(lambda a, b: a+b).collect()
#print LList
#Last_PageRank = sc.parallelize(LList)
print Last_PageRank.collect()
收集RDD,将其打印到屏幕,然后使用RDD.collect()
将列表重新分发为RDD。当我这样做时,计算速度明显加快。时间数据(没有sc.parallelize()
)如下:
.collect()
相比之下,当我使用Num Iterations Time(s)
5 14.356s
10 27.783s
15 47.983s
20 75.019s
25 108.298s
30 148.345s
35 195.525s
40 248.699s
解决方法时,40迭代版本只需要43.922秒。我希望,如果缓存按我认为的那样工作,原始版本应该(最多)采用43.9秒。
感谢任何帮助。顺便说一下,我很感激用任何解释(因为我是一个自学成才的程序员 - 通过教育的数学家)。