使用.toDF()将RDD转换为DataFrame使用SparkContext(非sqlContext)读取CSV数据时

时间:2017-08-19 22:25:57

标签: scala dataframe apache-spark-sql spark-dataframe

我是SparkSQL中的新手。请帮助我。 我的具体问题是,如果我们可以将RDD hospitalDataText转换为DataFrame(使用.toDF()),其中hospitalDataText已使用Spark上下文读取csv文件(不使用sqlContext.read.csv("path")) 。 为什么我们不能写 header.toDF()?如果我尝试将变量header RDD转换为DataFrame,则会抛出错误:value toDF is not a member of String为什么? 我的主要目的是希望使用header函数 查看变量.show() RDD的数据那么为什么我无法将RDD转换为DataFrame?请检查下面给出的代码! 看起来像DOUBLE-STANDARD :'(

scala> val hospitalDataText = sc.textFile("/Users/TheBhaskarDas/Desktop/services.csv")
hospitalDataText: org.apache.spark.rdd.RDD[String] = /Users/TheBhaskarDas/Desktop/services.csv MapPartitionsRDD[39] at textFile at <console>:33

scala> val header = hospitalDataText.first() //Remove the header
header: String = uhid,locationid,doctorid,billdate,servicename,servicequantity,starttime,endtime,servicetype,servicecategory,deptname
  

阶&GT; header.toDF()

<console>:38: error: value toDF is not a member of String
       header.toDF()

              ^
scala> val hospitalData = hospitalDataText.filter(a => a != header)
hospitalData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[40] at filter at <console>:37

scala> val m = hospitalData.toDF()
m: org.apache.spark.sql.DataFrame = [value: string]

scala> println(m)
[value: string]

scala> m.show()
+--------------------+
|               value|
+--------------------+
|32d84f8b9c5193838...|
|32d84f8b9c5193838...|
|213d66cb9aae532ff...|
|222f8f1766ed4e7c6...|
|222f8f1766ed4e7c6...|
|993f608405800f97d...|
|993f608405800f97d...|
|fa14c3845a8f1f6b0...|
|6e2899a575a534a1d...|
|6e2899a575a534a1d...|
|1f1603e3c0a0db5e6...|
|508a4fbea4752771f...|
|5f33395ae7422c3cf...|
|5f33395ae7422c3cf...|
|4ef07783ce800fc5d...|
|70c13902c9c9ccd02...|
|70c13902c9c9ccd02...|
|a950feff6911ab5e4...|
|b1a0d427adfdc4f7e...|
|b1a0d427adfdc4f7e...|
+--------------------+
only showing top 20 rows


scala> m.show(1)
+--------------------+
|               value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row


scala> m.show(1,true)
+--------------------+
|               value|
+--------------------+
|32d84f8b9c5193838...|
+--------------------+
only showing top 1 row


scala> m.show(1,2)
+-----+
|value|
+-----+
|   32|
+-----+
only showing top 1 row

1 个答案:

答案 0 :(得分:3)

您一直说iheader,而您发布的输出清楚地表明RDDheaderString不会返回first()。您无法在RDD上使用show(),但可以使用String