转换给定的spark数据帧(Spark版本2.0,scala 2.11),
A B
a 2*Z12*CA9*ThisnThat10*51827630323*fa2
b 1*C7*Friends5*names1*O2
c 4*19456*helpme6*please
d 2*M13*fin2*na2*325*123456*fancy2
采用以下格式(在scala或pyspark中)。
A B
a Z1*CA*ThisnThat*5182763032*fa2
b C*Friends*names*O
c 1945*helpme*please
d M1*fin*na*32*12345*fancy2
使用的逻辑 - 在每一行中,使用第一个数值对下一个值进行子串。使用剩余的数字部分来提取下一个值,依此类推......
e.g。为第一个字符串
(2*Z12*CA9*ThisnThat10*51827630323*fa2) -
* Use the first 2 to break 'Z12' into 'Z1' (two characters) with 2 remaining.
* Use this 2 to break 'CA9' into 'CA' (two characters) with 9 remaining.
* Use this 9 to break 'ThisnThat10' into 'ThisnThat' (9 characters) and 10.
* Use the 10 to break '51827630323' into '5182763032' (10 characters) and 3.
* Use the 3 to break 'fa2' into 'fa2' (3 characters).
我可以分割字符串并使用动态列数创建宽数据帧 - 但我无法找出缩短字符串的UDF。
答案 0 :(得分:2)
您可以创建一个UDF来处理列B
,如下所示。 Try
用于验证整数转换,foldLeft
用于遍历拆分子字符串以执行所需的处理逻辑。
请注意,tuple
(String,Integer)用作foldLeft
的累加器,以迭代转换字符串并继承计算的长度值(n
)
val df = Seq(
("a", "2*Z12*CA9*ThisnThat10*51827630323*fa2"),
("b", "1*C7*Friends5*names1*O2"),
("c", "4*19456*helpme6*please"),
("d", "2*M13*fin2*na2*325*123456*fancy2")
).toDF("A", "B")
def processString = udf( (s: String) => {
import scala.util.{Try, Success, Failure}
val arr = s.split("\\*")
val firstN = Try(arr.head.toInt) match {
case Success(i) => i
case Failure(_) => 0
}
arr.tail.foldLeft( ("", firstN) ){ (acc, x) =>
val n = Try( x.drop(acc._2).toInt ) match {
case Success(i) => i
case Failure(_) => 0
}
( acc._1 + "*" + x.take(acc._2), n )
}._1.tail
} )
df.select($"A", processString($"B").as("B")).
show(false)
// +---+------------------------------+
// |A |B |
// +---+------------------------------+
// |a |Z1*CA*ThisnThat*5182763032*fa2|
// |b |C*Friends*names*O |
// |c |1945*helpme*please |
// |d |M1*fin*na*32*12345*fancy2 |
// +---+------------------------------+
答案 1 :(得分:1)
假设您有以下dataframe
(数据来自问题)
+---+-------------------------------------+
|A |B |
+---+-------------------------------------+
|a |2*Z12*CA9*ThisnThat10*51827630323*fa2|
|b |1*C7*Friends5*names1*O2 |
|c |4*19456*helpme6*please |
|d |2*M13*fin2*na2*325*123456*fancy2 |
+---+-------------------------------------+
然后你在udf函数中需要一个递归函数作为
import org.apache.spark.sql.functions._
def shorteningUdf = udf((actualStr: String) => {
val arrayStr = actualStr.split("\\*")
val nextSubStrIndex = arrayStr.head.toInt
val listBuffer = new ListBuffer[String]
def recursiveFund(arrayStr2: List[String], index: Int, resultStrBuff: ListBuffer[String]): ListBuffer[String] = arrayStr2 match{
case head :: Nil => resultStrBuff += head.splitAt(index)._1
case head :: tail => {
val splitStr = head.splitAt(index)
recursiveFund(tail, splitStr._2.toInt, resultStrBuff += splitStr._1)
}
case _ => resultStrBuff
}
recursiveFund(arrayStr.tail.toList, nextSubStrIndex, listBuffer).mkString("*")
})
所以当你调用udf
函数
df.withColumn("B", shorteningUdf(col("B"))).show(false)
您将获得所需的输出
+---+------------------------------+
|A |B |
+---+------------------------------+
|a |Z1*CA*ThisnThat*5182763032*fa2|
|b |C*Friends*names*O |
|c |1945*helpme*please |
|d |M1*fin*na*32*12345*fancy2 |
+---+------------------------------+
我希望答案很有帮助