将大型数据框拆分为多个较小的数据框

时间:2019-07-26 18:26:19

标签: scala dataframe apache-spark filter

我有一个更大的数据框,其中包含100多个列,并且列集具有相同的名称,并且具有唯一的编号。将基于此唯一编号创建多个较小的数据框。

是的,列名具有相同的模式,此类组的数量有时可以是64,有时可以是128。 net1,net2,net3 ... net64 ... net128

我需要有64个subdfs或128个subdfs。我不能使用startswith,因为列名net1,net10,net11 ... net100,net101 ...可以匹配

我已经在Spark + Scala中创建了一个解决方案,它可以正常工作,但是我觉得必须有一种更简单的方法来动态实现它

df_net.printSchema()

|-- net1: string (nullable = true)
|-- net1_a: integer (nullable = true)
|-- net1_b: integer (nullable = true)
|-- net1_c: integer (nullable = true)
|-- net1_d: integer (nullable = true)
|-- net1_e: integer (nullable = true)
|-- net2: string (nullable = true)
|-- net2_a: integer (nullable = true)
|-- net2_b: integer (nullable = true)
|-- net2_c: integer (nullable = true)
|-- net2_d: integer (nullable = true)
|-- net2_e: integer (nullable = true)
|-- net3: string (nullable = true)
|-- net3_a: integer (nullable = true)
|-- net3_b: integer (nullable = true)
|-- net3_c: integer (nullable = true)
|-- net3_d: integer (nullable = true)
|-- net3_e: integer (nullable = true)
|-- net4: string (nullable = true)
|-- net4_a: integer (nullable = true)
|-- net4_b: integer (nullable = true)
|-- net4_c: integer (nullable = true)
|-- net4_d: integer (nullable = true)
|-- net4_e: integer (nullable = true)
|-- net5: string (nullable = true)
|-- net5_a: integer (nullable = true)
|-- net5_b: integer (nullable = true)
|-- net5_c: integer (nullable = true)
|-- net5_d: integer (nullable = true)
|-- net5_e: integer (nullable = true)
val df_net1 = df_net
                        .filter(!($"net1".isNull))
.select("net1","net1_a","net1_b","net1_c","net1_d","net1_e")

val df_net2 = df_net
                        .filter(!($"net2".isNull))
                        .select("net2","net2_a","net2_b","net2_c","net2_d","net2_e")

val df_net3 = df_net
                        .filter(!($"net3".isNull))
                        .select("net3","net3_a","net3_b","net3_c","net3_d","net3_e")

根据唯一编号过滤的较小数据帧

4 个答案:

答案 0 :(得分:1)

假设您的DF可预测地分为6列,下面将产生一个scala> df.printSchema root |-- net1: string (nullable = false) |-- net1_a: integer (nullable = false) |-- net1_b: integer (nullable = false) |-- net1_c: integer (nullable = false) |-- net1_d: integer (nullable = false) |-- net1_e: integer (nullable = false) |-- net2: string (nullable = false) |-- net2_a: integer (nullable = false) |-- net2_b: integer (nullable = false) |-- net2_c: integer (nullable = false) |-- net2_d: integer (nullable = false) |-- net2_e: integer (nullable = false) |-- net3: string (nullable = false) |-- net3_a: integer (nullable = false) |-- net3_b: integer (nullable = false) |-- net3_c: integer (nullable = false) |-- net3_d: integer (nullable = false) |-- net3_e: integer (nullable = false) |-- net4: string (nullable = false) |-- net4_a: integer (nullable = false) |-- net4_b: integer (nullable = false) |-- net4_c: integer (nullable = false) |-- net4_d: integer (nullable = false) |-- net4_e: integer (nullable = false) |-- net5: string (nullable = false) |-- net5_a: integer (nullable = false) |-- net5_b: integer (nullable = false) |-- net5_c: integer (nullable = false) |-- net5_d: integer (nullable = false) |-- net5_e: integer (nullable = false) scala> val sub_dfs = df.schema.map(_.name).grouped(6).map{fields => df.select(fields.map(col): _*).where(col(fields.head).isNotNull)} sub_dfs: Iterator[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = non-empty iterator scala> sub_dfs.foreach{_.printSchema} root |-- net1: string (nullable = false) |-- net1_a: integer (nullable = false) |-- net1_b: integer (nullable = false) |-- net1_c: integer (nullable = false) |-- net1_d: integer (nullable = false) |-- net1_e: integer (nullable = false) root |-- net2: string (nullable = false) |-- net2_a: integer (nullable = false) |-- net2_b: integer (nullable = false) |-- net2_c: integer (nullable = false) |-- net2_d: integer (nullable = false) |-- net2_e: integer (nullable = false) root |-- net3: string (nullable = false) |-- net3_a: integer (nullable = false) |-- net3_b: integer (nullable = false) |-- net3_c: integer (nullable = false) |-- net3_d: integer (nullable = false) |-- net3_e: integer (nullable = false) root |-- net4: string (nullable = false) |-- net4_a: integer (nullable = false) |-- net4_b: integer (nullable = false) |-- net4_c: integer (nullable = false) |-- net4_d: integer (nullable = false) |-- net4_e: integer (nullable = false) root |-- net5: string (nullable = false) |-- net5_a: integer (nullable = false) |-- net5_b: integer (nullable = false) |-- net5_c: integer (nullable = false) |-- net5_d: integer (nullable = false) |-- net5_e: integer (nullable = false) ,其中每个元素都包含来自父数据集的6列:

document.getElementsByClassName('abc')[0].value = "good "; 

$("textarea[class|='abc']")[0].val('good '); 

$("textarea.abc").focus();
    var e = $.Event('keypress');
e.which = 65; // Character 'A'

答案 1 :(得分:1)

数据帧中的似乎列具有某种模式,因为它们将从某个常见字符串开始,如果那不会改变。您可以使用类似下面的内容。

val df_net1 = df.select(df.columns.filter(a => a.startsWith("net1")).map(a => 
df(a)) : _*)

val df_net2 = df.select(df.columns.filter(a => a.startsWith("net2")).map(a => 
df(a)) : _*)

val df_net3 = df.select(df.columns.filter(a => a.startsWith("net3")).map(a => 
df(a)) : _*)

答案 2 :(得分:1)

假设您在各列中具有通用的前缀名称。此解决方案将适用于...具有相同前缀的可变数量的列。.

package examples

import org.apache.log4j.Level
import org.apache.spark.sql.{DataFrame, SparkSession}

object FilterDataframes extends App {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)
  val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local[*]").getOrCreate()

  import spark.implicits._

  val df = spark
    .sparkContext.parallelize(Seq(new MyNets())).toDF
  df.show


  case class MyNets(
                     net1: Int = 1,
                     net1_a: Int = 2,
                     net1_b: Int = 3,
                     net1_c: Int = 4,
                     net1_d: Int = 4,
                     net1_e: Int = 5,
                     net2: Int = 6,
                     net2_a: Int = 7,
                     net2_b: Int = 8,
                     net2_c: Int = 9,
                     net2_d: Int = 10,
                     net2_e: Int = 11,
                     net3: Int = 12,
                     net3_a: Int = 13,
                     net3_b: Int = 14,
                     net3_c: Int = 15,
                     net3_d: Int = 16,
                     net4_e: Int = 17,
                     net5: Int = 18,
                     net5_a: Int = 19,
                     net5_b: Int = 20,
                     net5_c: Int = 21,
                     net5_d: Int = 22,
                     net5_e: Int = 23
                   )
  val net1: DataFrame = df.select(df.columns.filter(_.startsWith("net1")).map(df(_)): _*)
  val net2: DataFrame = df.select(df.columns.filter(_.startsWith("net2")).map(df(_)): _*)
  val net3: DataFrame = df.select(df.columns.filter(_.startsWith("net3")).map(df(_)): _*)
  val net4: DataFrame = df.select(df.columns.filter(_.startsWith("net4")).map(df(_)): _*)
  val net5: DataFrame = df.select(df.columns.filter(_.startsWith("net5")).map(df(_)): _*)

  net1.show
  net2.show
  net3.show
  net4.show
  net5.show
}

结果:

+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+
|net1|net1_a|net1_b|net1_c|net1_d|net1_e|net2|net2_a|net2_b|net2_c|net2_d|net2_e|net3|net3_a|net3_b|net3_c|net3_d|net4_e|net5|net5_a|net5_b|net5_c|net5_d|net5_e|
+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+
|   1|     2|     3|     4|     4|     5|   6|     7|     8|     9|    10|    11|  12|    13|    14|    15|    16|    17|  18|    19|    20|    21|    22|    23|
+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+

+----+------+------+------+------+------+
|net1|net1_a|net1_b|net1_c|net1_d|net1_e|
+----+------+------+------+------+------+
|   1|     2|     3|     4|     4|     5|
+----+------+------+------+------+------+

+----+------+------+------+------+------+
|net2|net2_a|net2_b|net2_c|net2_d|net2_e|
+----+------+------+------+------+------+
|   6|     7|     8|     9|    10|    11|
+----+------+------+------+------+------+

+----+------+------+------+------+
|net3|net3_a|net3_b|net3_c|net3_d|
+----+------+------+------+------+
|  12|    13|    14|    15|    16|
+----+------+------+------+------+

+------+
|net4_e|
+------+
|    17|
+------+

+----+------+------+------+------+------+
|net5|net5_a|net5_b|net5_c|net5_d|net5_e|
+----+------+------+------+------+------+
|  18|    19|    20|    21|    22|    23|
+----+------+------+------+------+------+

现在您可以对结果数据帧进行null检查。

答案 3 :(得分:0)

我会用net_type字段将不同组的净字段折叠为一组。然后,您可以进行分区写入,这将使您可以轻松地加载单个集或根据需要加载多个集。

这给您带来了很多好处:

  • 如果您需要进行汇总以按类型或其他方式进行计数,则很容易
  • 您可以加载一组或任意数量的子集。
  • Spark将根据您在net_type上过滤的值自动确定要为您加载哪些值
  • 所有输出文件将由Spark一次性写入,而不是每组写入一次

这是执行此操作的代码:

import org.apache.spark.sql.functions._

case class Net(net1:Integer, 
               net1_a:Integer,
               net1_b:Integer,
               net2:Integer,
               net2_a:Integer,
               net2_b:Integer)

val df = Seq(
    Net(1, 1, 1, null, null, null),
    Net(2, 2, 2, null, null, null),
    Net(null, null, null, 3, 3, 3)
).toDS

// You could find these automatically if you wanted
val columns = Seq("net1", "net2")

// Turn each group of fields into a struct with a populated "net_type" field
val structColumns = columns.map(c => 
    when(col(c).isNotNull, 
        struct(
            lit(c) as "net_type",
            col(c) as "net",
            col(c + "_a") as "net_a",
            col(c + "_b") as "net_b"
        )
    )
)

// Put into one column the populated group for each row
val df2 = df.select(coalesce(structColumns:_*) as "net")

// Flatten back down to top level fields instead of being in a struct
val df3 = df2.selectExpr("net.*")

df.write.partitionBy("net_type").parquet("/some/file/path.parquet")

这会给你这样的行:

scala> df3.show
+--------+---+-----+-----+
|net_type|net|net_a|net_b|
+--------+---+-----+-----+
|    net1|  1|    1|    1|
|    net1|  2|    2|    2|
|    net2|  3|    3|    3|
+--------+---+-----+-----+

文件系统中的文件如下:

/some/file/path.parquet/
    net_type=net1/
        part1.parquet
        ..
    net_type=net2/
        part1.parquet
        ..