我尝试创建一个包含100多个attribues的类并将其解析为dataframe但是我收到了错误:
I got this Error: too many arguments for unapply pattern, maximum = 22
所以我试试这个solution,但它会随意生成每个列的值,而不是我的情况; 这就是我的rdd的样子:
N1 N2 N3 N4 N5 Nn
32055680 16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
32055680 16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
32055680 16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
32055680 16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
32055680 16/09/2010 16:59:59:245 16/09/2016 17:00:00:000 xxxxxxxxxxxxx
我想将其转换为像此架构一样的spark sql
| N1 | N2 | N3 | N4 | N5 | Nn |
| ----------------------------------------------------------------------------------------|
| 32055680 | 16/09/2010 | 16:59:59:245 | 16/09/2016 | 17:00:00:000 | xxxxx |
| 32055680 | 16/09/2010 | 16:59:59:245 | 16/09/2016 | 17:00:00:000 | xxxxx |
| 32055680 | 16/09/2010 | 16:59:59:245 | 16/09/2016 | 17:00:00:000 | xxxxx |
| 20556800 | 16/09/2010 | 16:59:59:245 | 16/09/2016 | 17:00:00:000 | xxxxx |
| 32055680 | 16/09/2010 | 16:59:59:245 | 16/09/2016 | 17:00:00:000 | xxxxx |
-------------------------------------------------------------------------------------------
这是我的RDD:
val file = spContext.textFile("C:/***/files/ze.xl3")
val file2 = file.zipWithIndex().filter(_._2 > 5).map(_._1)
我正在使用这个例子:
import scala.util.Random
val numCols = 100
val numRows = 5
val delimiter = "\t"
val sqlContext = new org.apache.spark.sql.SQLContext(spContext)
import org.apache.spark.sql._
import sqlContext.implicits._
def generateRowData = (0 until numCols).map(i => Random.alphanumeric.take(5).mkString).mkString(delimiter)
val df = spContext.parallelize((0 until numRows).map(i => generateRowData).toList).toDF("data")
def extractCol(i: Int, sep: String) = udf[String, String](_.split(sep)(i))
val result = (0 until numCols).foldLeft(df){case (acc,i) => acc.withColumn(s"c$i", extractCol(i,delimiter)($"data"))}.drop($"data")
result.printSchema
result.show
我的问题是如何使用我的rdd数据来丰富列? 谢谢
答案 0 :(得分:0)
val numCols = 104
val delimiter = "\t"
val sqlContext = new org.apache.spark.sql.SQLContext(spContext)
import org.apache.spark.sql._
import sqlContext.implicits._
val file = spContext.textFile("C:/***/files/ze.xl3")
val file2 = file.zipWithIndex().filter(_._2 > 5).map(_._1)
val df = file2.toDF("raw")
def extractCol(i: Int, sep: String) = udf[String, String](_.split(sep)(i))
val result = (0 until numCols).foldLeft(df){case (acc,i) => acc.withColumn(s"c$i", extractCol(i,delimiter)($"raw"))}.drop($"raw")
result.printSchema
result.show