使用Scala从RDD中的Array [String]获取数据

时间:2017-03-27 21:54:46

标签: arrays scala apache-spark rdd

我有一个表格的数组[String]:

res3: Array[String] =
Array("{{Infobox officeholder
|name=Abraham Lincoln
|image=Abraham Lincoln November 1863.jpg{{!}}border
|term_start=March 4, 1861
|term_end=April 15, 1865
|term_start2=March 4, 1847,
"{{Infobox officeholder
|name=Mickael Jackson
|term_start=April 9, 1991
|term_end=April 15, 1865
|term_start2=March 4, 1847")

现在,我需要创建一个表单数组:

("Abraham Lincoln: March 4, 1861",
"Michael Jackson: April 9, 1991",
...

但是,term_start并不总是在数组中的同一索引中。所以,我需要一些方法来使用正则表达式或每行包含。 有什么方法可以使用scala吗?数据从bz2文件加载,然后转换为这种方式。 非常感谢。

1 个答案:

答案 0 :(得分:0)

我真的不了解你的输出格式,但是这个使用数据框的例子可以帮助你解决问题:

case class Message(text: String)

val iterations: (String => Array[String]) = (input: String) => {
  input.split('|')
}
val udf_iterations = udf(iterations)

val transformation: (String => String) = (input: String) => {
  input.split("=")(1).trim + ": " + input.split("=")(0).trim
}
val udf_transformation = udf(transformation)

val p1 = Message("AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1")
val p2 = Message("ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2")

val records = Seq(p1, p2)
val df = spark.createDataFrame(records)

df.withColumn("text-explode", explode(udf_iterations(col("text"))))
  .withColumn("text-transformed", udf_transformation(col("text-explode")))
  .show(false)

+---------------------------------------+-------------+----------------+
|text                                   |text-explode |text-transformed|
+---------------------------------------+-------------+----------------+
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1|AAA=valAAA1  |valAAA1: AAA    |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| BBB=valBBB1 |valBBB1: BBB    |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| CCC=valCCC1 |valCCC1: CCC    |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2|ZZZ=valZZZ2  |valZZZ2: ZZZ    |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| AAA=valAAA2 |valAAA2: AAA    |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| BBB=valBBB2 |valBBB2: BBB    |
+---------------------------------------+-------------+----------------+