我无法将带有zipWithIndex
的rdd转换为数据帧。
我已从文件中读取,我需要跳过前3条记录,然后将记录限制为第10行。为此,我使用了rdd.zipwithindex
。
但之后,当我尝试保存7条记录时,我无法这样做。
val df = spark.read.format("com.databricks.spark.csv")
.option("delimiter", delimValue)
.option("header", "false")
.load("/user/ashwin/data1/datafile.txt")
val df1 = df.rdd.zipWithIndex()
.filter(x => { x._2 > 3&& x._2 <= 10;})
.map(f => Row(f._1))
val skipValue = 3
val limitValue = 10
val delimValue = ","
df1.foreach(f2=> print(f2.toString))
[[113,3Bapi,Ghosh,86589579]][[114,4Bapi,Ghosh,86589579]]
[[115,5Bapi,Ghosh,86589579]][[116,6Bapi,Ghosh,86589579]]
[[117,7Bapi,Ghosh,86589579]][[118,8Bapi,Ghosh,86589579]]
[[119,9Bapi,Ghosh,86589579]]
scala> val df = spark.read.format("com.databricks.spark.csv").option("delimiter", delimValue).option("header", "true").load("/user/bigframe/ashwin/data1/datafile.txt")
df: org.apache.spark.sql.DataFrame = [empid: string, fname: string ... 2 more fields]
scala> val df1 = df.rdd.zipWithIndex().filter(x => { x._2 > skipValue && x._2 <= limitValue;}).map(f => Row(f._1))
df1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[885] at map at <console>:38
scala> import spark.implicits._
import spark.implicits._
阶&GT; DF1。
++ count flatMap groupBy mapPartitionsWithIndex reduce takeAsync union
aggregate countApprox fold id max repartition takeOrdered unpersist
cache countApproxDistinct foreach intersection min sample takeSample zip
cartesian countAsync foreachAsync isCheckpointed name saveAsObjectFile toDebugString zipPartitions
checkpoint countByValue foreachPartition isEmpty partitioner saveAsTextFile toJavaRDD zipWithIndex
coalesce countByValueApprox foreachPartitionAsync iterator partitions setName toLocalIterator zipWithUniqueId
collect dependencies getCheckpointFile keyBy persist sortBy toString
collectAsync distinct getNumPartitions localCheckpoint pipe sparkContext top
compute filter getStorageLevel map preferredLocations subtract treeAggregate
context first glom mapPartitions randomSplit take treeReduce
scala> df1.toDF
<console>:44: error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
df1.toDF
^
答案 0 :(得分:2)
将RDD[ROW]
更改为dataframe
后,您会获得rdd
,因此要转换回dataframe
,您需要按sqlContext.createDataframe()
还需要Schema来创建dataframe
,在这种情况下,您可以使用之前在df
val df1 = df.rdd.zipWithIndex()
.filter(x => { x._2 > 3&& x._2 <= 10})
.map(_._1)
val result = spark.sqlContext.createDataFrame(df1, df.schema)
希望这有帮助!
答案 1 :(得分:0)
现在可能属于RDD[Row]
类型。你尝试过使用toDF
功能吗?您还必须import spark.implicits._
。