Question

我试图用Python在Apache-Spark（1.4.0）中实现一个朴素的PageRank版本。可以找到算法的细节（它的工作方式）here（在静止向量I的矩阵H处向下看三分之一）。

PageRank是迭代的（在每个步骤中，每个顶点向它的邻居发出它的当前页面等级的份额，然后reduce函数收集发送到每个顶点的页面等级）。这导致循环，其中RDD被循环和更新（RDD是只读的，因此它实际上创建了新的RDD）。原则上，人们应该能够使用/** * Removes an attribute from this request. This method is not generally * needed as attributes only persist as long as the request is being * handled. */ public void removeAttribute(String name);。

我遇到的问题是，在我的循环中，即使使用RDD.cache()，也会在循环的每次迭代中重新计算RDD。我知道RDD适合在内存中（因为我使用的输入文件非常小，RDD只有8个元素）。我的代码如下：

.cache()

为了说明为什么我认为缓存不起作用，我将上面的代码定时，循环的迭代次数为5,10,15，...，40。然后，我更改了代码在每个步骤中，我使用from pyspark import SparkContext, SparkConf import sys def my_map(line): #The structure of line is (vertex, ([list of outgoing edges], current_PageRank)) out_edges = line[1][0] current_PageRank = line[1][1] e = len(out_edges) if e > 0: return_list = [] for f in out_edges: return_list.append( (f, current_PageRank/float(e)) ) return return_list conf = SparkConf().setAppName("PageRank") sc = SparkContext(conf=conf) fileName = sys.argv[1] # lines is an RDD (list) where each element of the RDD is a string (one line of the text file). lines = sc.textFile(fileName) # edge_list is an RDD where each element is the list of integers from a line of the text file. # edge_list is cached because we will refer to it numerous times throughout the computation. # Each element of edge_list is of the form (vertex, [out neighbors]), so (int, list). edge_list = lines.map(lambda line: (int(line.split()[0]), [int(x) for x in line.split()[1:]]) ).cache() # vertex_set is an RDD that is the list of all vertices. vertex_set = edge_list.map(lambda row: row[0]) # N is the number of vertices in the graph. N = vertex_set.count() # Initialize the PageRank vector. # Each vertex will be keyed with its initial value (1/N where N is the number of vertices) # Elements of Last_PageRank have the form (vertex, PageRank), so (int, float). Last_PageRank = vertex_set.map(lambda x: (x, 1.0/N) ).cache() for number in xrange(40): Last_PageRank = edge_list.join(Last_PageRank).flatMap(my_map).reduceByKey(lambda a, b: a+b).cache() ### In version 2, I comment the previous and last line out, and un-comment the following 3 lines. #LList = edge_list.join(Last_PageRank).flatMap(my_map).reduceByKey(lambda a, b: a+b).collect() #print LList #Last_PageRank = sc.parallelize(LList) print Last_PageRank.collect()收集RDD，将其打印到屏幕，然后使用RDD.collect()将列表重新分发为RDD。当我这样做时，计算速度明显加快。时间数据（没有sc.parallelize()）如下：

.collect()

相比之下，当我使用Num Iterations Time(s) 5 14.356s 10 27.783s 15 47.983s 20 75.019s 25 108.298s 30 148.345s 35 195.525s 40 248.699s解决方法时，40迭代版本只需要43.922秒。我希望，如果缓存按我认为的那样工作，原始版本应该（最多）采用43.9秒。

感谢任何帮助。顺便说一下，我很感激用任何解释（因为我是一个自学成才的程序员 - 通过教育的数学家）。

如何使用Python强制在Apache-Spark中进行缓存

0 个答案: