我有一个表格的数组[String]:
res3: Array[String] =
Array("{{Infobox officeholder
|name=Abraham Lincoln
|image=Abraham Lincoln November 1863.jpg{{!}}border
|term_start=March 4, 1861
|term_end=April 15, 1865
|term_start2=March 4, 1847,
"{{Infobox officeholder
|name=Mickael Jackson
|term_start=April 9, 1991
|term_end=April 15, 1865
|term_start2=March 4, 1847")
现在,我需要创建一个表单数组:
("Abraham Lincoln: March 4, 1861",
"Michael Jackson: April 9, 1991",
...
但是,term_start并不总是在数组中的同一索引中。所以,我需要一些方法来使用正则表达式或每行包含。 有什么方法可以使用scala吗?数据从bz2文件加载,然后转换为这种方式。 非常感谢。
答案 0 :(得分:0)
我真的不了解你的输出格式,但是这个使用数据框的例子可以帮助你解决问题:
case class Message(text: String)
val iterations: (String => Array[String]) = (input: String) => {
input.split('|')
}
val udf_iterations = udf(iterations)
val transformation: (String => String) = (input: String) => {
input.split("=")(1).trim + ": " + input.split("=")(0).trim
}
val udf_transformation = udf(transformation)
val p1 = Message("AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1")
val p2 = Message("ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2")
val records = Seq(p1, p2)
val df = spark.createDataFrame(records)
df.withColumn("text-explode", explode(udf_iterations(col("text"))))
.withColumn("text-transformed", udf_transformation(col("text-explode")))
.show(false)
+---------------------------------------+-------------+----------------+
|text |text-explode |text-transformed|
+---------------------------------------+-------------+----------------+
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1|AAA=valAAA1 |valAAA1: AAA |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| BBB=valBBB1 |valBBB1: BBB |
|AAA=valAAA1 | BBB=valBBB1 | CCC=valCCC1| CCC=valCCC1 |valCCC1: CCC |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2|ZZZ=valZZZ2 |valZZZ2: ZZZ |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| AAA=valAAA2 |valAAA2: AAA |
|ZZZ=valZZZ2 | AAA=valAAA2 | BBB=valBBB2| BBB=valBBB2 |valBBB2: BBB |
+---------------------------------------+-------------+----------------+