dataframe丰富了列错误

时间:2017-04-05 14:14:50

标签: apache-spark apache-spark-sql spark-dataframe rdd

我尝试创建一个包含100多个attribues的类并将其解析为dataframe但是我收到了错误:

I got this Error: too many arguments for unapply pattern, maximum = 22

所以我试试这个solution,但它会随意生成每个列的值,而不是我的情况; 这就是我的rdd的样子:

            N1           N2       N3           N4       N5               Nn
        32055680    16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
        32055680    16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
        32055680    16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
        32055680    16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
        32055680    16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx

我想将其转换为像此架构一样的spark sql

    |      N1    |         N2       |     N3         |      N4      |          N5     |    Nn |
    | ----------------------------------------------------------------------------------------|
    |   32055680 |   16/09/2010     |   16:59:59:245 |  16/09/2016  |   17:00:00:000  | xxxxx |
    |   32055680 |   16/09/2010     |   16:59:59:245 |  16/09/2016  |   17:00:00:000  | xxxxx |
    |   32055680 |   16/09/2010     |   16:59:59:245 |  16/09/2016  |   17:00:00:000  | xxxxx |
    |   20556800 |   16/09/2010     |   16:59:59:245 |  16/09/2016  |   17:00:00:000  | xxxxx |
    |   32055680 |   16/09/2010     |   16:59:59:245 |  16/09/2016  |   17:00:00:000  | xxxxx | 
    ------------------------------------------------------------------------------------------- 

这是我的RDD:

val file = spContext.textFile("C:/***/files/ze.xl3")
val file2 = file.zipWithIndex().filter(_._2 > 5).map(_._1)

我正在使用这个例子:

    import scala.util.Random
    val numCols = 100
    val numRows = 5 
    val delimiter = "\t"
    val sqlContext = new org.apache.spark.sql.SQLContext(spContext)
    import org.apache.spark.sql._
    import sqlContext.implicits._
def generateRowData = (0 until numCols).map(i => Random.alphanumeric.take(5).mkString).mkString(delimiter)
val df = spContext.parallelize((0 until numRows).map(i => generateRowData).toList).toDF("data")
def extractCol(i: Int, sep: String) = udf[String, String](_.split(sep)(i))       
val result = (0 until numCols).foldLeft(df){case (acc,i) => acc.withColumn(s"c$i", extractCol(i,delimiter)($"data"))}.drop($"data")
result.printSchema
result.show

我的问题是如何使用我的rdd数据来丰富列? 谢谢

1 个答案:

答案 0 :(得分:0)

 val numCols = 104
val delimiter = "\t"
val sqlContext = new org.apache.spark.sql.SQLContext(spContext)
import org.apache.spark.sql._
import sqlContext.implicits._

val file = spContext.textFile("C:/***/files/ze.xl3")
val file2 = file.zipWithIndex().filter(_._2 > 5).map(_._1)
val df = file2.toDF("raw")

def extractCol(i: Int, sep: String) = udf[String, String](_.split(sep)(i))       
val result = (0 until numCols).foldLeft(df){case (acc,i) => acc.withColumn(s"c$i", extractCol(i,delimiter)($"raw"))}.drop($"raw")
result.printSchema
result.show