我是pyspark的新手,正在尝试了解PageRank的工作原理。我在Cloudera的Jupyter中使用Spark 1.6。我的顶点和边(以及架构)的屏幕截图位于以下链接中:verticesRDD和edgesRDD
我的代码到目前为止:
#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *
#Read the csv files
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")
#Renaming the id columns to enable GraphFrame
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")
#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed
#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)
现在我运行pageRank函数:
g.pageRank(resetProbability=0.15, maxIter=10)
Py4JJavaError:调用o98.run时发生错误:org.apache.spark.SparkException:由于阶段失败导致作业中止:阶段79.0中的任务0失败1次,最近失败:阶段79.0中失去任务0.0( TID 2637,localhost):scala.MatchError:[null,null,[913460,765,8 / 31/2015 23:26,Harry Bridges Plaza(Ferry Building),50,8 / 31/2015 23:39,旧金山Caltrain(Townsend at 4th),70,288,Subscriber,2139]](类org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
results = g.pageRank(resetProbability=0.15, maxIter=10, sourceId="id")
Py4JJavaError:调用o166.run时发生错误:org.graphframes.NoSuchVertexException:GraphFrame算法给定了图中不存在的顶点ID。 VertFID ID不包含在GraphFrame中(v:[id:int,name:string,lat:double,long:double,dockcount:int,landmark:string,installation:string],e:[src:string,dst:string ,id:int,Duration:int,Start Date:string,Start Terminal:int,End Date:string,End Terminal:int,Bike#:int,Subscriber Type:string,Zip Code:string])
ranks = g.pageRank.resetProbability(0.15).maxIter(10).run()
属性错误:'功能'对象没有属性" resetProbability'
ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()
Py4JJavaError:调用o188.run时发生错误:org.apache.spark.SparkException:作业因阶段失败而中止:阶段90.0中的任务0失败1次,最近失败:阶段90.0中丢失任务0.0( TID 2641,localhost):scala.MatchError:[null,null,[913460,765,8 / 31/2015 23:26,Harry Bridges Plaza(Ferry Building),50,8 / 31/2015 23:39,旧金山Caltrain(Townsend at 4th),70,288,Subscriber,2139]](类org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
我正在阅读PageRank,但不明白我哪里出错......任何帮助都将不胜感激
答案 0 :(得分:0)
问题是我如何定义我的顶点。我正在重命名" station_id"到" id",实际上,它必须是" name。所以这一行
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
必须是
verticesRDD = verticesRDD.withColumnRenamed("name", "id")
PageRank正常使用此更改!