Scala转换并将字符串列拆分为数据帧中的MapType列

时间:2020-09-14 04:21:16

标签: scala dataframe apache-spark

我有一个带有一列的DataFrame,其中包含带有内部字段的跟踪请求网址,看起来像这样

df.show(truncate = false)
+--------------------------------
| request_uri
+-----------------------------------
| /i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349 ...
| /i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00 ...
| ...

我需要将此列转换为类似这样的内容

df.show(truncate = false)
+--------------------------------
| request_uri
+--------------------------------
| (aid -> fptplay, ast -> 1582163970763, tz -> [timezone datatype], nt -> wifi , ...) 
| (p -> fplay-ottbox-2019, av -> 2.0.18, ov -> 9, tv -> 1.0.0 , ...) 
| ...

基本上,我必须将字段名称(delimiter =“&”)及其值拆分为某种MapType,并将其添加到列中。

有人可以给我指点如何编写自定义函数以将字符串列拆分为MapType列吗? 有人告诉我使用withColumn()和mapPartition,但我不知道如何以将字符串拆分并将其转换为MapType的方式来实现它。

即使最小的帮助也将由衷地感谢。我是Scala的新手,已经坚持了一周。

2 个答案:

答案 0 :(得分:0)

解决方案是使用UserDefinedFunctions

让我们一次解决这个问题。

// We need a function which converts strings into maps
// based on the format of request uris
def requestUriToMap(s: String): Map[String, String] = {
  s.stripPrefix("/i?").split("&").map(elem => {
    val pair = elem.split("=")     
    (pair(0), pair(1)) // evaluate each element to a tuple
  }).toMap
}

// Now we convert this function into a UserDefinedFunction.
import org.apache.spark.sql.functions.{col, udf}
// Given to a request uri string, convert it to map, the correct format is assumed.
val requestUriToMapUdf = udf((requestUri: String) => requestUriToMap(requestUri))

现在,我们进行测试。

// Test data
val df = Seq(
  ("/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349"),
  ("/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00")
).toDF("request_uri")

df.show(false)
//+-----------------------------------------------------------------------+
//|request_uri                                                            |
//+-----------------------------------------------------------------------+
//|/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349         |
//|/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00|
//+-----------------------------------------------------------------------+

// Now we execute our UDF to create a column, using the same name replaces that column
val mappedDf = df.withColumn("request_uri", requestUriToMapUdf(col("request_uri")))
mappedDf.show(false)
//+---------------------------------------------------------------------------------------------+
//|request_uri                                                                                  |
//+---------------------------------------------------------------------------------------------+
//|[aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349]                 |
//|[av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi]|
//+---------------------------------------------------------------------------------------------+

mappedDf.printSchema
//root
// |-- request_uri: map (nullable = true)
// |    |-- key: string
// |    |-- value: string (valueContainsNull = true)

mappedDf.schema
//org.apache.spark.sql.types.StructType = StructType(StructField(request_uri,MapType(StringType,StringType,true),true))

这就是您想要的。


替代方法:如果不确定字符串是否符合要求,则可以尝试该函数的其他变体,即使该字符串不符合假定的格式(不包含=或输入,该函数也可以成功是一个空字符串)。

def requestUriToMapImproved(s: String): Map[String, String] = {
  s.stripPrefix("/i?").split("&").map(elem => {
    val pair = elem.split("=")
    pair.length match {
      case 0 => ("", "") // in case the given string produces an array with no elements e.g. "=".split("=") == Array() 
      case 1 => (pair(0), "") // in case the given string contains no = and produces a single element e.g. "potato".split("=") == Array("potato")
      case _ => (pair(0), pair(1)) // normal case e.g. "potato=masher".split("=") == Array("potato", "masher")
    }
  }).toMap
}

答案 1 :(得分:0)

以下代码执行两阶段拆分过程。由于uris没有特定的结构,因此您可以使用以下UDF来实现:

val keys = List("aid", "p", "ast", "av", "did", "nt", "ov", "tv", "tz")

def convertToMap(keys: List[String]) = udf  {
  (in : mutable.WrappedArray[String]) =>
    in.foldLeft[Map[String, String]](Map()){ (a, str) =>
      keys.flatMap { key =>
        val regex = s"""${key}="""
        val arr = str.split(regex)
        val value = {
          if(arr.length == 2) arr(1)
            else ""
        }

        if(!value.isEmpty)
          a + (key -> value)
        else
          a
      }.toMap
    }
}

df.withColumn("_tmp",
  split($"request_uri","""((&)|(\?))"""))
  .withColumn("map_result", convertToMap(keys)($"_tmp"))
  .select($"map_result")
  .show(false)

它提供了一个MapType列:

+------------------------------------------------------------------------------------------------+
|map_result                                                                                      |
+------------------------------------------------------------------------------------------------+
|Map(aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349)                 |
|Map(av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi)|
+------------------------------------------------------------------------------------------------+