如何使用Python强制在Apache-Spark中进行缓存

时间:2015-09-24 14:39:18

标签: python apache-spark pagerank

我试图用Python在Apache-Spark(1.4.0)中实现一个朴素的PageRank版本。可以找到算法的细节(它的工作方式)here(在静止向量I的矩阵H处向下看三分之一)。

PageRank是迭代的(在每个步骤中,每个顶点向它的邻居发出它的当前页面等级的份额,然后reduce函数收集发送到每个顶点的页面等级)。这导致循环,其中RDD被循环和更新(RDD是只读的,因此它实际上创建了新的RDD)。原则上,人们应该能够使用 /** * Removes an attribute from this request. This method is not generally * needed as attributes only persist as long as the request is being * handled. */ public void removeAttribute(String name);

我遇到的问题是,在我的循环中,即使使用RDD.cache(),也会在循环的每次迭代中重新计算RDD。我知道RDD适合在内存中(因为我使用的输入文件非常小,RDD只有8个元素)。我的代码如下:

.cache()

为了说明为什么我认为缓存不起作用,我将上面的代码定时,循环的迭代次数为5,10,15,...,40。然后,我更改了代码在每个步骤中,我使用from pyspark import SparkContext, SparkConf import sys def my_map(line): #The structure of line is (vertex, ([list of outgoing edges], current_PageRank)) out_edges = line[1][0] current_PageRank = line[1][1] e = len(out_edges) if e > 0: return_list = [] for f in out_edges: return_list.append( (f, current_PageRank/float(e)) ) return return_list conf = SparkConf().setAppName("PageRank") sc = SparkContext(conf=conf) fileName = sys.argv[1] # lines is an RDD (list) where each element of the RDD is a string (one line of the text file). lines = sc.textFile(fileName) # edge_list is an RDD where each element is the list of integers from a line of the text file. # edge_list is cached because we will refer to it numerous times throughout the computation. # Each element of edge_list is of the form (vertex, [out neighbors]), so (int, list). edge_list = lines.map(lambda line: (int(line.split()[0]), [int(x) for x in line.split()[1:]]) ).cache() # vertex_set is an RDD that is the list of all vertices. vertex_set = edge_list.map(lambda row: row[0]) # N is the number of vertices in the graph. N = vertex_set.count() # Initialize the PageRank vector. # Each vertex will be keyed with its initial value (1/N where N is the number of vertices) # Elements of Last_PageRank have the form (vertex, PageRank), so (int, float). Last_PageRank = vertex_set.map(lambda x: (x, 1.0/N) ).cache() for number in xrange(40): Last_PageRank = edge_list.join(Last_PageRank).flatMap(my_map).reduceByKey(lambda a, b: a+b).cache() ### In version 2, I comment the previous and last line out, and un-comment the following 3 lines. #LList = edge_list.join(Last_PageRank).flatMap(my_map).reduceByKey(lambda a, b: a+b).collect() #print LList #Last_PageRank = sc.parallelize(LList) print Last_PageRank.collect() 收集RDD,将其打印到屏幕,然后使用RDD.collect()将列表重新分发为RDD。当我这样做时,计算速度明显加快。时间数据(没有sc.parallelize())如下:

.collect()

相比之下,当我使用Num Iterations Time(s) 5 14.356s 10 27.783s 15 47.983s 20 75.019s 25 108.298s 30 148.345s 35 195.525s 40 248.699s 解决方法时,40迭代版本只需要43.922秒。我希望,如果缓存按我认为的那样工作,原始版本应该(最多)采用43.9秒。

感谢任何帮助。顺便说一下,我很感激用任何解释(因为我是一个自学成才的程序员 - 通过教育的数学家)。

0 个答案:

没有答案