Question

我最近开始使用spark。目前我正在测试具有不同顶点和边缘类型的二分图。

从我在graphx中所做的研究来看，我需要对边缘进行子类化，以获得不同的边缘和一些具有属性的属性。

以下是代码片段：

scala> trait VertexProperty
defined trait VertexProperty

scala> case class paperProperty(val paperid: Long, val papername: String, val doi: String, val keywords: String) extends VertexProperty
defined class paperProperty

scala> case class authorProperty(val authorid: Long, val authorname: String) extends VertexProperty
defined class authorProperty

scala> val docsVertces: RDD[(VertexId, VertexProperty)] = docs.rdd.map(x => (x(0).asInstanceOf[VertexId],paperProperty(x(0).asInstanceOf[VertexId],x(1).asInstanceOf[String],x(2).asInstanceOf[String],x(3).asInstanceOf[String])))
docsVertces: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, VertexProperty)] = MapPartitionsRDD[23] at map at <console>:47

scala> val authorVertces: RDD[(VertexId, VertexProperty)] = authors.rdd.map(x => (x(0).asInstanceOf[VertexId],authorProperty(x(0).asInstanceOf[Long],x(1).asInstanceOf[String])))
authorVertces: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, VertexProperty)] = MapPartitionsRDD[24] at map at <console>:41

scala> val vertices = VertexRDD(docsVertces ++ authorVertces)
vertices: org.apache.spark.graphx.VertexRDD[VertexProperty] = VertexRDDImpl[28] at RDD at VertexRDD.scala:57

scala>

然而我失败了。

scala> class EdgeProperty()
defined class EdgeProperty

scala> case class authorEdgeProperty( val doccount: Long) extends  EdgeProperty()
defined class authorEdgeProperty

scala> case class citeEdgeProperty() extends  EdgeProperty()
defined class citeEdgeProperty

scala> // edge using subclass will not work we need to have one consistent superclass

scala> val docauthoredges = docauthor.map(x => Edge(x(0).asInstanceOf[VertexId],x(1).asInstanceOf[VertexId],     authorEdgeProperty(x(1).asInstanceOf[Long])))
docauthoredges: org.apache.spark.sql.Dataset[org.apache.spark.graphx.Edge[authorEdgeProperty]] = [srcId: bigint, dstId: bigint ... 1 more field]

scala> val docciteedges = doccites.map(x => Edge(x(0).asInstanceOf[VertexId],x(1).asInstanceOf[VertexId], citeEdgeProperty()))
docciteedges: org.apache.spark.sql.Dataset[org.apache.spark.graphx.Edge[citeEdgeProperty]] = [srcId: bigint, dstId: bigint ... 1 more field]

scala> docauthoredges.unionAll(docciteedges)
<console>:52: error: type mismatch;
 found   :    org.apache.spark.sql.Dataset[org.apache.spark.graphx.Edge[citeEdgeProperty]]
 required: org.apache.spark.sql.Dataset[org.apache.spark.graphx.Edge[authorEdgeProperty]]
       docauthoredges.unionAll(docciteedges)
                               ^

scala>

我试图将边缘强制转换为超类，并收到以下消息：

scala> val docauthoredges = docauthor.map(x => Edge(x(0).asInstanceOf[VertexId],x(1).asInstanceOf[VertexId],         authorEdgeProperty(x(1).asInstanceOf[Long]).asInstanceOf[EdgeProperty]))
java.lang.UnsupportedOperationException: No Encoder found for EdgeProperty
- field (class: "EdgeProperty", name: "attr")
- root class: "org.apache.spark.graphx.Edge"
  at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:598)
...

非常感谢任何帮助

Answer 1

你的问题有点徒劳，因为GraphX不支持Datasets，并且边和顶点都应该作为RDDs传递，但是为了参数：

您获得了第一个异常，因为Spark中的分布式数据结构是不变的。不要使用asInstanceOf。只需明确类型注释即可。
您获得了第二个例外，因为Datasets因使用Encoders而受到进一步限制。 Dataset中的所有对象必须使用相同的Encoder，在这种情况下，只能使用二进制编码器，而用户定义的类不能隐式访问它。

将这两部分组合在一起：

import org.apache.spark.sql.{Dataset, Encoders}

sealed trait EdgeProperty

case class AuthorEdgeProperty(val doccount: Long) extends  EdgeProperty
case class CiteEdgeProperty() extends EdgeProperty

val docauthoredges: Dataset[EdgeProperty] = spark.range(10)
  .map(AuthorEdgeProperty(_): EdgeProperty)(Encoders.kryo[EdgeProperty])

val docciteedges: Dataset[EdgeProperty] = spark.range(5)
  .map(_ => CiteEdgeProperty(): EdgeProperty)(Encoders.kryo[EdgeProperty])

val edges: Dataset[EdgeProperty] = docauthoredges.union(docciteedges)

转换为RDD以使其在GraphX中可用：

edges.rdd

spark graphx多边类型

1 个答案: