Question

我是pyspark的新手，正在尝试了解PageRank的工作原理。我在Cloudera的Jupyter中使用Spark 1.6。我的顶点和边（以及架构）的屏幕截图位于以下链接中：verticesRDD和edgesRDD

我的代码到目前为止：

#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *

#Read the csv files 
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")

#Renaming the id columns to enable GraphFrame 
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")

#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed

#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)

现在我运行pageRank函数：

g.pageRank(resetProbability=0.15, maxIter=10)

Py4JJavaError：调用o98.run时发生错误：org.apache.spark.SparkException：由于阶段失败导致作业中止：阶段79.0中的任务0失败1次，最近失败：阶段79.0中失去任务0.0（ TID 2637，localhost）：scala.MatchError：[null，null，[913460,765,8 / 31/2015 23：26，Harry Bridges Plaza（Ferry Building），50,8 / 31/2015 23:39，旧金山Caltrain（Townsend at 4th），70,288，Subscriber，2139]]（类org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema）

results = g.pageRank(resetProbability=0.15, maxIter=10, sourceId="id")

Py4JJavaError：调用o166.run时发生错误：org.graphframes.NoSuchVertexException：GraphFrame算法给定了图中不存在的顶点ID。 VertFID ID不包含在GraphFrame中（v：[id：int，name：string，lat：double，long：double，dockcount：int，landmark：string，installation：string]，e：[src：string，dst：string ，id：int，Duration：int，Start Date：string，Start Terminal：int，End Date：string，End Terminal：int，Bike＃：int，Subscriber Type：string，Zip Code：string]）

ranks = g.pageRank.resetProbability(0.15).maxIter(10).run()

属性错误：＆＃39;功能＆＃39;对象没有属性＆quot; resetProbability＆＃39;

ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()

Py4JJavaError：调用o188.run时发生错误：org.apache.spark.SparkException：作业因阶段失败而中止：阶段90.0中的任务0失败1次，最近失败：阶段90.0中丢失任务0.0（ TID 2641，localhost）：scala.MatchError：[null，null，[913460,765,8 / 31/2015 23：26，Harry Bridges Plaza（Ferry Building），50,8 / 31/2015 23:39，旧金山Caltrain（Townsend at 4th），70,288，Subscriber，2139]]（类org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema）

我正在阅读PageRank，但不明白我哪里出错......任何帮助都将不胜感激

Answer 1

问题是我如何定义我的顶点。我正在重命名＆＃34; station_id＆＃34;到＆＃34; id＆＃34;，实际上，它必须是＆＃34; name。所以这一行

verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")

必须是

verticesRDD = verticesRDD.withColumnRenamed("name", "id")

PageRank正常使用此更改！

GraphFrames的PageRank中的错误

1 个答案: