我必须使用标题和字段创建一个DataFrame。 标题和字段位于文件中。该文件指定如下。 架构在field5中,col1,col2 ...是我的架构,值在field6之后。
(?:[a-zA-Z\.\d]{1})+@\.com
上面是文件,我想提取值col1,col2 ....... col8并从中创建一个Struct并创建一个数据帧,其值为field6之后的值。
我应该用普通的Java代码提取field5吗?是否可以在Spark Java中完成?
答案 0 :(得分:0)
我执行以下操作(但我使用Scala,因此将其转换为Java是您的主要练习):
spark.read.text
让我们看看Scala代码:
val input = spark.read.text("input.txt")
scala> input.show(false)
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|field1 value1; |
|field2 value2; |
|field3 value3; |
|field4 value4; |
|field5 17 col1 col2 col3 col4 col5 col6 col7 col8;|
|field6 |
|val1 val 2 val3 val4 val5 val6 val7 val8 |
|val9 val10 val11 val12 val13 val14 val15 val16 |
|val17 val18 val19 val20 val21 val22 val23 val24; |
|EndOfFile; |
+--------------------------------------------------+
// trying to impress future readers ;-)
val unnecessaryLines = (2 to 4).
map(n => 'value startsWith s"field$n").
foldLeft('value startsWith "field1") { case (f, orfield) => f or orfield }.
or('value startsWith "field6").
or('value startsWith "EndOfFile")
scala> unnecessaryLines.explain(true)
((((StartsWith('value, field1) || StartsWith('value, field2)) || StartsWith('value, field3)) || StartsWith('value, field4)) || StartsWith('value, EndOfFile))
// Filter out the irrelevant lines
val onlyRelevantLines = input.filter(!unnecessaryLines)
scala> onlyRelevantLines.show(false)
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|field5 17 col1 col2 col3 col4 col5 col6 col7 col8;|
|val1 val 2 val3 val4 val5 val6 val7 val8 |
|val9 val10 val11 val12 val13 val14 val15 val16 |
|val17 val18 val19 val20 val21 val22 val23 val24; |
+--------------------------------------------------+
有了这个,我们得到了文件中唯一的相关行。 获得乐趣的时间!
// Remove field5 from the first line only and `;` at the end
val field5 = onlyRelevantLines.head.getString(0) // we're leaving Spark space and enter Scala
// the following is pure Scala code (no Spark whatsoever)
val header = field5.substring("field5 17 ".size).dropRight(1).split("\\s+").toSeq
val rows = onlyRelevantLines.filter(!('value startsWith "field5"))
scala> :type rows
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
scala> rows.show(false)
+------------------------------------------------+
|value |
+------------------------------------------------+
|val1 val 2 val3 val4 val5 val6 val7 val8 |
|val9 val10 val11 val12 val13 val14 val15 val16 |
|val17 val18 val19 val20 val21 val22 val23 val24;|
+------------------------------------------------+
有了这个,你应该分割Dataset
行(每个空格)。在未发布的Spark 2.2.0中,将会有一个方法csv来加载数据集,给定的分隔符会给我们想要的东西:
def csv(csvDataset: Dataset[String]): DataFrame
那还没有,所以我们必须做类似的事情。
让我们尽可能坚持使用Spark SQL的数据集API。
val words = rows.select(split($"value", "\\s+") as "words")
scala> words.show(false)
+---------------------------------------------------------+
|words |
+---------------------------------------------------------+
|[val1, val, 2, val3, val4, val5, val6, val7, val8] |
|[val9, val10, val11, val12, val13, val14, val15, val16] |
|[val17, val18, val19, val20, val21, val22, val23, val24;]|
+---------------------------------------------------------+
// The following is just a series of withColumn's for every column in header
val finalDF = header.zipWithIndex.foldLeft(words) { case (df, (hdr, idx)) =>
df.withColumn(hdr, $"words".getItem(idx)) }.
drop("words")
scala> finalDF.show
+-----+-----+-----+-----+-----+-----+-----+------+
| col1| col2| col3| col4| col5| col6| col7| col8|
+-----+-----+-----+-----+-----+-----+-----+------+
| val1| val| 2| val3| val4| val5| val6| val7|
| val9|val10|val11|val12|val13|val14|val15| val16|
|val17|val18|val19|val20|val21|val22|val23|val24;|
+-----+-----+-----+-----+-----+-----+-----+------+
完成!