使用scala读取CSV文件

时间:2017-12-15 06:02:16

标签: scala apache-spark dataframe

我正在从程序中读取数据(使用sc.parallelize)并且能够读取数据并对该数据集应用进一步的转换和操作。

val d = sc.parallelize(Seq(11 -> Seq(21,51,61,111,112), 
                            21 -> Seq(51,111,112,115,116), 
                            31-> Seq(61,111,112,117,121), 
                            41-> Seq(31,111,112,117,122)))

/* d of type val d: RDD[(Int, Seq[Int])]*/

val thes = 2
 val r = d
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= thes)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray)
r.toDF().show() /*giving the expected output*/

我正在尝试从文件读取相同的数据集并能够读取数据,但在应用转换和操作时(与上面相同),我无法从字符串转换为整数。

/*inputfile
 id,sid
 11,"21,51,61,111,112"
 21,"51,111,112,115,116"
 31,"61,111,112,117,121"
 41,"31,111,112,117,122"
 */

val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").load(inputFile)
.rdd.map(x=> x.getAs[Int]("id") -> (x.getAs[String]("sid").split(",")
.toList.toSeq).map(_.toInt))

 val thes = 2
 /* df of type val df: RDD[(Int, Seq[Int])]*/

 val r = df
.flatMapValues(x=>x)
.map(_.swap)
.groupByKey
.map(_._2)
.flatMap(x=>expand(x.toSeq))
.map(_ -> 1)
.reduceByKey(_+_)
.filter(_._2>= thes)
.map(_._1)
.flatMap(x=> Seq(x._1 -> x._2, x._2 -> x._1))
.groupByKey.mapValues(_.toArray)

r1.toDF().show()

我在flatmap操作本身期间无法从字符串转换为整数(java.lang.ClassCastException:java.lang.String不能转换为java.lang.Integer)。即使df显示RDD [(Int,Seq [Int])类型,也不确定字符串在图片中的位置。

def expand(seq : Seq[Int]): Seq[(Int, Int)] = 
if (seq.isEmpty) 
Seq[(Int, Int)]() 
else 
seq.tail.map(x=> seq.head -> x) ++ expand(seq.tail)

0 个答案:

没有答案