如何使用文件中的列和字段创建DataFrame?

时间:2017-05-16 09:09:03

标签: java apache-spark apache-spark-sql

我必须使用标题和字段创建一个DataFrame。 标题和字段位于文件中。该文件指定如下。 架构在field5中,col1,col2 ...是我的架构,值在field6之后。

(?:[a-zA-Z\.\d]{1})+@\.com

上面是文件,我想提取值col1,col2 ....... col8并从中创建一个Struct并创建一个数据帧,其值为field6之后的值。

我应该用普通的Java代码提取field5吗?是否可以在Spark Java中完成?

1 个答案:

答案 0 :(得分:0)

我执行以下操作(但我使用Scala,因此将其转换为Java是您的主要练习):

  1. 使用spark.read.text
  2. 将文件作为常规(几乎非结构化)文本文件加载
  3. 过滤掉不相关的行
  4. 使用请求的架构和行创建另一个DataFrame
  5. 让我们看看Scala代码:

    val input = spark.read.text("input.txt")
    scala> input.show(false)
    +--------------------------------------------------+
    |value                                             |
    +--------------------------------------------------+
    |field1 value1;                                    |
    |field2 value2;                                    |
    |field3 value3;                                    |
    |field4 value4;                                    |
    |field5 17 col1 col2 col3 col4 col5 col6 col7 col8;|
    |field6                                            |
    |val1 val 2 val3 val4 val5 val6 val7 val8          |
    |val9 val10 val11 val12 val13 val14 val15 val16    |
    |val17 val18 val19 val20 val21 val22 val23 val24;  |
    |EndOfFile;                                        |
    +--------------------------------------------------+
    
    // trying to impress future readers ;-)
    val unnecessaryLines = (2 to 4).
      map(n => 'value startsWith s"field$n").
      foldLeft('value startsWith "field1") { case (f, orfield) => f or orfield }.
      or('value startsWith "field6").
      or('value startsWith "EndOfFile")
    scala> unnecessaryLines.explain(true)
    ((((StartsWith('value, field1) || StartsWith('value, field2)) || StartsWith('value, field3)) || StartsWith('value, field4)) || StartsWith('value, EndOfFile))
    
    // Filter out the irrelevant lines
    val onlyRelevantLines = input.filter(!unnecessaryLines)
    scala> onlyRelevantLines.show(false)
    +--------------------------------------------------+
    |value                                             |
    +--------------------------------------------------+
    |field5 17 col1 col2 col3 col4 col5 col6 col7 col8;|
    |val1 val 2 val3 val4 val5 val6 val7 val8          |
    |val9 val10 val11 val12 val13 val14 val15 val16    |
    |val17 val18 val19 val20 val21 val22 val23 val24;  |
    +--------------------------------------------------+
    

    有了这个,我们得到了文件中唯一的相关行。 获得乐趣的时间!

    // Remove field5 from the first line only and `;` at the end
    val field5 = onlyRelevantLines.head.getString(0) // we're leaving Spark space and enter Scala
    // the following is pure Scala code (no Spark whatsoever)
    val header = field5.substring("field5 17 ".size).dropRight(1).split("\\s+").toSeq
    
    val rows = onlyRelevantLines.filter(!('value startsWith "field5"))
    scala> :type rows
    org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
    scala> rows.show(false)
    +------------------------------------------------+
    |value                                           |
    +------------------------------------------------+
    |val1 val 2 val3 val4 val5 val6 val7 val8        |
    |val9 val10 val11 val12 val13 val14 val15 val16  |
    |val17 val18 val19 val20 val21 val22 val23 val24;|
    +------------------------------------------------+
    

    有了这个,你应该分割Dataset行(每个空格)。在未发布的Spark 2.2.0中,将会有一个方法csv来加载数据集,给定的分隔符会给我们想要的东西:

    def csv(csvDataset: Dataset[String]): DataFrame
    

    那还没有,所以我们必须做类似的事情。

    让我们尽可能坚持使用Spark SQL的数据集API。

    val words = rows.select(split($"value", "\\s+") as "words")
    scala> words.show(false)
    +---------------------------------------------------------+
    |words                                                    |
    +---------------------------------------------------------+
    |[val1, val, 2, val3, val4, val5, val6, val7, val8]       |
    |[val9, val10, val11, val12, val13, val14, val15, val16]  |
    |[val17, val18, val19, val20, val21, val22, val23, val24;]|
    +---------------------------------------------------------+
    
    // The following is just a series of withColumn's for every column in header
    
    val finalDF = header.zipWithIndex.foldLeft(words) { case (df, (hdr, idx)) =>
      df.withColumn(hdr, $"words".getItem(idx)) }.
      drop("words")
    scala> finalDF.show
    +-----+-----+-----+-----+-----+-----+-----+------+
    | col1| col2| col3| col4| col5| col6| col7|  col8|
    +-----+-----+-----+-----+-----+-----+-----+------+
    | val1|  val|    2| val3| val4| val5| val6|  val7|
    | val9|val10|val11|val12|val13|val14|val15| val16|
    |val17|val18|val19|val20|val21|val22|val23|val24;|
    +-----+-----+-----+-----+-----+-----+-----+------+
    

    完成!