spark dataframe - 基于字符串中数字的动态子字符串

时间:2018-04-21 04:26:43

标签: arrays regex scala apache-spark dataframe

转换给定的spark数据帧(Spark版本2.0,scala 2.11),

A   B
a   2*Z12*CA9*ThisnThat10*51827630323*fa2
b   1*C7*Friends5*names1*O2
c   4*19456*helpme6*please
d   2*M13*fin2*na2*325*123456*fancy2

采用以下格式(在scala或pyspark中)。

A   B
a   Z1*CA*ThisnThat*5182763032*fa2
b   C*Friends*names*O
c   1945*helpme*please
d   M1*fin*na*32*12345*fancy2

使用的逻辑 - 在每一行中,使用第一个数值对下一个值进行子串。使用剩余的数字部分来提取下一个值,依此类推......

e.g。为第一个字符串

(2*Z12*CA9*ThisnThat10*51827630323*fa2) - 
* Use the first 2 to break 'Z12' into 'Z1' (two characters) with 2 remaining.  
* Use this 2 to break 'CA9' into 'CA' (two characters) with 9 remaining.  
* Use this 9 to break 'ThisnThat10' into 'ThisnThat' (9 characters) and 10.  
* Use the 10 to break '51827630323' into '5182763032' (10 characters) and 3.  
* Use the 3 to break 'fa2' into 'fa2' (3 characters).  

我可以分割字符串并使用动态列数创建宽数据帧 - 但我无法找出缩短字符串的UDF。

2 个答案:

答案 0 :(得分:2)

您可以创建一个UDF来处理列B,如下所示。 Try用于验证整数转换,foldLeft用于遍历拆分子字符串以执行所需的处理逻辑。

请注意,tuple(String,Integer)用作foldLeft的累加器,以迭代转换字符串并继承计算的长度值(n

val df = Seq(
  ("a", "2*Z12*CA9*ThisnThat10*51827630323*fa2"),
  ("b", "1*C7*Friends5*names1*O2"),
  ("c", "4*19456*helpme6*please"),
  ("d", "2*M13*fin2*na2*325*123456*fancy2")
).toDF("A", "B")

def processString = udf( (s: String) => {
  import scala.util.{Try, Success, Failure}

  val arr = s.split("\\*")
  val firstN = Try(arr.head.toInt) match {
    case Success(i) => i
    case Failure(_) => 0
  }

  arr.tail.foldLeft( ("", firstN) ){ (acc, x) =>
    val n = Try( x.drop(acc._2).toInt ) match {
      case Success(i) => i
      case Failure(_) => 0
    }
    ( acc._1 + "*" + x.take(acc._2), n )
  }._1.tail
} )

df.select($"A", processString($"B").as("B")).
  show(false)
// +---+------------------------------+
// |A  |B                             |
// +---+------------------------------+
// |a  |Z1*CA*ThisnThat*5182763032*fa2|
// |b  |C*Friends*names*O             |
// |c  |1945*helpme*please            |
// |d  |M1*fin*na*32*12345*fancy2     |
// +---+------------------------------+

答案 1 :(得分:1)

假设您有以下dataframe(数据来自问题)

+---+-------------------------------------+
|A  |B                                    |
+---+-------------------------------------+
|a  |2*Z12*CA9*ThisnThat10*51827630323*fa2|
|b  |1*C7*Friends5*names1*O2              |
|c  |4*19456*helpme6*please               |
|d  |2*M13*fin2*na2*325*123456*fancy2     |
+---+-------------------------------------+

然后你在udf函数中需要一个递归函数作为

import org.apache.spark.sql.functions._
def shorteningUdf = udf((actualStr: String) => {
  val arrayStr = actualStr.split("\\*")
  val nextSubStrIndex = arrayStr.head.toInt
  val listBuffer = new ListBuffer[String]
  def recursiveFund(arrayStr2: List[String], index: Int, resultStrBuff: ListBuffer[String]): ListBuffer[String] = arrayStr2 match{
    case head :: Nil => resultStrBuff += head.splitAt(index)._1
    case head :: tail => {
      val splitStr = head.splitAt(index)
      recursiveFund(tail, splitStr._2.toInt, resultStrBuff += splitStr._1)
    }
    case _ => resultStrBuff
  }
  recursiveFund(arrayStr.tail.toList, nextSubStrIndex, listBuffer).mkString("*")
})

所以当你调用udf函数

df.withColumn("B", shorteningUdf(col("B"))).show(false)

您将获得所需的输出

+---+------------------------------+
|A  |B                             |
+---+------------------------------+
|a  |Z1*CA*ThisnThat*5182763032*fa2|
|b  |C*Friends*names*O             |
|c  |1945*helpme*please            |
|d  |M1*fin*na*32*12345*fancy2     |
+---+------------------------------+

我希望答案很有帮助