我正在从程序中读取数据(使用sc.parallelize)并且能够读取数据并对该数据集应用进一步的转换和操作。
val d = sc.parallelize(Seq(11 -> Seq(21,51,61,111,112),
21 -> Seq(51,111,112,115,116),
31-> Seq(61,111,112,117,121),
41-> Seq(31,111,112,117,122)))
/* d of type val d: RDD[(Int, Seq[Int])]*/
val thes = 2
val r = d
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= thes)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray)
r.toDF().show() /*giving the expected output*/
我正在尝试从文件读取相同的数据集并能够读取数据,但在应用转换和操作时(与上面相同),我无法从字符串转换为整数。
/*inputfile
id,sid
11,"21,51,61,111,112"
21,"51,111,112,115,116"
31,"61,111,112,117,121"
41,"31,111,112,117,122"
*/
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").load(inputFile)
.rdd.map(x=> x.getAs[Int]("id") -> (x.getAs[String]("sid").split(",")
.toList.toSeq).map(_.toInt))
val thes = 2
/* df of type val df: RDD[(Int, Seq[Int])]*/
val r = df
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= thes)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray)
r1.toDF().show()
我在flatmap操作本身期间无法从字符串转换为整数(java.lang.ClassCastException:java.lang.String不能转换为java.lang.Integer)。即使df显示RDD [(Int,Seq [Int])类型,也不确定字符串在图片中的位置。
def expand(seq : Seq[Int]): Seq[(Int, Int)] =
if (seq.isEmpty)
Seq[(Int, Int)]()
else
seq.tail.map(x=> seq.head -> x) ++ expand(seq.tail)