如何在Pyspark数据帧中使用长度拆分和MaxSplit拆分列?

时间:2020-07-01 11:38:55

标签: pyspark apache-spark-sql pyspark-dataframes

例如

如果我通过调用并在Pyspark中显示CSV给出了以下栏目

+--------+
|   Names|
+--------+
|Rahul   |
|Ravi    |
|Raghu   |
|Romeo   |
+--------+

如果我在函数中指定“ Such”

长度= 2 Maxsplit = 3

然后我必须得到结果

+----------+-----------+----------+
|Col_1     |Col_2      |Col_3     |
+----------+-----------+----------+
|      Ra  |      hu   |    l     |
|      Ra  |      vi   |    Null  |
|      Ra  |      gh   |    u     |
|      Ro  |      me   |    o     |
+----------+-----------+----------+

类似地在Pyspark

长度= 3 最大拆分​​= 2,它应该为我提供输出,例如

+----------+-----------+
|Col_1     |Col_2      |
+----------+-----------+
|      Rah |      ul   |
|      Rav |      i    |
|      Rag |      hu   |
|      Rom |      eo   |
+----------+-----------+

这就是它的样子,谢谢

4 个答案:

答案 0 :(得分:4)

另一种解决方法。应该比任何循环或udf解决方案都要快。

from pyspark.sql import functions as F

def split(df,length,maxsplit):
    return df.withColumn('Names',F.split("Names","(?<=\\G{})".format('.'*length)))\
               .select(*((F.col("Names")[x]).alias("Col_"+str(x+1)) for x in range(0,maxsplit)))
  
split(df,3,2).show()

#+-----+-----+
#|Col_1|Col_2|
#+-----+-----+
#|  Rah|   ul|
#|  Rav|    i|
#|  Rag|   hu|
#|  Rom|   eo|
#+-----+-----+

split(df,2,3).show()

#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#|   Ra|   hu|    l|
#|   Ra|   vi|     |
#|   Ra|   gh|    u|
#|   Ro|   me|    o|
#+-----+-----+-----+

答案 1 :(得分:1)

尝试一下,

import pyspark.sql.functions as F
tst = sqlContext.createDataFrame([("Raghu",1),("Ravi",2),("Rahul",3)],schema=["Name","val"])
def fn (split,max_n,tst):
    for i in range(max_n):
        tst_loop=tst.withColumn("coln"+str(i),F.substring(F.col("Name"),(i*split)+1,split))
        tst=tst_loop
    return(tst)
tst_res = fn(3,2,tst)

for循环也可以用列表理解或reduce代替,但是我觉得在您看来,for循环看起来更整洁。他们仍然有相同的身体计划。

结果

+-----+---+-----+-----+
| Name|val|coln0|coln1|
+-----+---+-----+-----+
|Raghu|  1|  Rag|   hu|
| Ravi|  2|  Rav|    i|
|Rahul|  3|  Rah|   ul|
+-----+---+-----+-----+

答案 2 :(得分:1)

尝试一下

div {
  background: // How to use the accent color (blue or red depending on the theme)?
}

答案 3 :(得分:1)

也许这很有用-

加载测试数据

注意:用scala编写

  val Length = 2
    val Maxsplit = 3
    val df = Seq("Rahul", "Ravi", "Raghu", "Romeo").toDF("Names")
    df.show(false)
    /**
      * +-----+
      * |Names|
      * +-----+
      * |Rahul|
      * |Ravi |
      * |Raghu|
      * |Romeo|
      * +-----+
      */

根据长度和偏移量分割字符串col


    val schema = StructType(Range(1, Maxsplit + 1).map(f => StructField(s"Col_$f", StringType)))
    val split = udf((str:String, length: Int, maxSplit: Int) =>{
      val splits = str.toCharArray.grouped(length).map(_.mkString).toArray
      RowFactory.create(splits ++ Array.fill(maxSplit-splits.length)(null): _*)
    }, schema)

    val p = df
     .withColumn("x", split($"Names", lit(Length), lit(Maxsplit)))
     .selectExpr("x.*")

    p.show(false)
    p.printSchema()

    /**
      * +-----+-----+-----+
      * |Col_1|Col_2|Col_3|
      * +-----+-----+-----+
      * |Ra   |hu   |l    |
      * |Ra   |vi   |null |
      * |Ra   |gh   |u    |
      * |Ro   |me   |o    |
      * +-----+-----+-----+
      *
      * root
      * |-- Col_1: string (nullable = true)
      * |-- Col_2: string (nullable = true)
      * |-- Col_3: string (nullable = true)
      */

Dataset[Row]-> Dataset[Array[String]]

 val x = df.map(r => {
      val splits = r.getString(0).toCharArray.grouped(Length).map(_.mkString).toArray
      splits ++ Array.fill(Maxsplit-splits.length)(null)
    })
    x.show(false)
    x.printSchema()

    /**
      * +-----------+
      * |value      |
      * +-----------+
      * |[Ra, hu, l]|
      * |[Ra, vi,]  |
      * |[Ra, gh, u]|
      * |[Ro, me, o]|
      * +-----------+
      *
      * root
      * |-- value: array (nullable = true)
      * |    |-- element: string (containsNull = true)
      */