我有一个更大的数据框,其中包含100多个列,并且列集具有相同的名称,并且具有唯一的编号。将基于此唯一编号创建多个较小的数据框。
是的,列名具有相同的模式,此类组的数量有时可以是64,有时可以是128。 net1,net2,net3 ... net64 ... net128
我需要有64个subdfs或128个subdfs。我不能使用startswith,因为列名net1,net10,net11 ... net100,net101 ...可以匹配
我已经在Spark + Scala中创建了一个解决方案,它可以正常工作,但是我觉得必须有一种更简单的方法来动态实现它
df_net.printSchema()
|-- net1: string (nullable = true)
|-- net1_a: integer (nullable = true)
|-- net1_b: integer (nullable = true)
|-- net1_c: integer (nullable = true)
|-- net1_d: integer (nullable = true)
|-- net1_e: integer (nullable = true)
|-- net2: string (nullable = true)
|-- net2_a: integer (nullable = true)
|-- net2_b: integer (nullable = true)
|-- net2_c: integer (nullable = true)
|-- net2_d: integer (nullable = true)
|-- net2_e: integer (nullable = true)
|-- net3: string (nullable = true)
|-- net3_a: integer (nullable = true)
|-- net3_b: integer (nullable = true)
|-- net3_c: integer (nullable = true)
|-- net3_d: integer (nullable = true)
|-- net3_e: integer (nullable = true)
|-- net4: string (nullable = true)
|-- net4_a: integer (nullable = true)
|-- net4_b: integer (nullable = true)
|-- net4_c: integer (nullable = true)
|-- net4_d: integer (nullable = true)
|-- net4_e: integer (nullable = true)
|-- net5: string (nullable = true)
|-- net5_a: integer (nullable = true)
|-- net5_b: integer (nullable = true)
|-- net5_c: integer (nullable = true)
|-- net5_d: integer (nullable = true)
|-- net5_e: integer (nullable = true)
val df_net1 = df_net
.filter(!($"net1".isNull))
.select("net1","net1_a","net1_b","net1_c","net1_d","net1_e")
val df_net2 = df_net
.filter(!($"net2".isNull))
.select("net2","net2_a","net2_b","net2_c","net2_d","net2_e")
val df_net3 = df_net
.filter(!($"net3".isNull))
.select("net3","net3_a","net3_b","net3_c","net3_d","net3_e")
根据唯一编号过滤的较小数据帧
答案 0 :(得分:1)
假设您的DF可预测地分为6列,下面将产生一个scala> df.printSchema
root
|-- net1: string (nullable = false)
|-- net1_a: integer (nullable = false)
|-- net1_b: integer (nullable = false)
|-- net1_c: integer (nullable = false)
|-- net1_d: integer (nullable = false)
|-- net1_e: integer (nullable = false)
|-- net2: string (nullable = false)
|-- net2_a: integer (nullable = false)
|-- net2_b: integer (nullable = false)
|-- net2_c: integer (nullable = false)
|-- net2_d: integer (nullable = false)
|-- net2_e: integer (nullable = false)
|-- net3: string (nullable = false)
|-- net3_a: integer (nullable = false)
|-- net3_b: integer (nullable = false)
|-- net3_c: integer (nullable = false)
|-- net3_d: integer (nullable = false)
|-- net3_e: integer (nullable = false)
|-- net4: string (nullable = false)
|-- net4_a: integer (nullable = false)
|-- net4_b: integer (nullable = false)
|-- net4_c: integer (nullable = false)
|-- net4_d: integer (nullable = false)
|-- net4_e: integer (nullable = false)
|-- net5: string (nullable = false)
|-- net5_a: integer (nullable = false)
|-- net5_b: integer (nullable = false)
|-- net5_c: integer (nullable = false)
|-- net5_d: integer (nullable = false)
|-- net5_e: integer (nullable = false)
scala> val sub_dfs = df.schema.map(_.name).grouped(6).map{fields => df.select(fields.map(col): _*).where(col(fields.head).isNotNull)}
sub_dfs: Iterator[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = non-empty iterator
scala> sub_dfs.foreach{_.printSchema}
root
|-- net1: string (nullable = false)
|-- net1_a: integer (nullable = false)
|-- net1_b: integer (nullable = false)
|-- net1_c: integer (nullable = false)
|-- net1_d: integer (nullable = false)
|-- net1_e: integer (nullable = false)
root
|-- net2: string (nullable = false)
|-- net2_a: integer (nullable = false)
|-- net2_b: integer (nullable = false)
|-- net2_c: integer (nullable = false)
|-- net2_d: integer (nullable = false)
|-- net2_e: integer (nullable = false)
root
|-- net3: string (nullable = false)
|-- net3_a: integer (nullable = false)
|-- net3_b: integer (nullable = false)
|-- net3_c: integer (nullable = false)
|-- net3_d: integer (nullable = false)
|-- net3_e: integer (nullable = false)
root
|-- net4: string (nullable = false)
|-- net4_a: integer (nullable = false)
|-- net4_b: integer (nullable = false)
|-- net4_c: integer (nullable = false)
|-- net4_d: integer (nullable = false)
|-- net4_e: integer (nullable = false)
root
|-- net5: string (nullable = false)
|-- net5_a: integer (nullable = false)
|-- net5_b: integer (nullable = false)
|-- net5_c: integer (nullable = false)
|-- net5_d: integer (nullable = false)
|-- net5_e: integer (nullable = false)
,其中每个元素都包含来自父数据集的6列:
document.getElementsByClassName('abc')[0].value = "good ";
$("textarea[class|='abc']")[0].val('good ');
$("textarea.abc").focus();
var e = $.Event('keypress');
e.which = 65; // Character 'A'
答案 1 :(得分:1)
数据帧中的似乎列具有某种模式,因为它们将从某个常见字符串开始,如果那不会改变。您可以使用类似下面的内容。
val df_net1 = df.select(df.columns.filter(a => a.startsWith("net1")).map(a =>
df(a)) : _*)
val df_net2 = df.select(df.columns.filter(a => a.startsWith("net2")).map(a =>
df(a)) : _*)
val df_net3 = df.select(df.columns.filter(a => a.startsWith("net3")).map(a =>
df(a)) : _*)
答案 2 :(得分:1)
假设您在各列中具有通用的前缀名称。此解决方案将适用于...具有相同前缀的可变数量的列。.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.{DataFrame, SparkSession}
object FilterDataframes extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
import spark.implicits._
val df = spark
.sparkContext.parallelize(Seq(new MyNets())).toDF
df.show
case class MyNets(
net1: Int = 1,
net1_a: Int = 2,
net1_b: Int = 3,
net1_c: Int = 4,
net1_d: Int = 4,
net1_e: Int = 5,
net2: Int = 6,
net2_a: Int = 7,
net2_b: Int = 8,
net2_c: Int = 9,
net2_d: Int = 10,
net2_e: Int = 11,
net3: Int = 12,
net3_a: Int = 13,
net3_b: Int = 14,
net3_c: Int = 15,
net3_d: Int = 16,
net4_e: Int = 17,
net5: Int = 18,
net5_a: Int = 19,
net5_b: Int = 20,
net5_c: Int = 21,
net5_d: Int = 22,
net5_e: Int = 23
)
val net1: DataFrame = df.select(df.columns.filter(_.startsWith("net1")).map(df(_)): _*)
val net2: DataFrame = df.select(df.columns.filter(_.startsWith("net2")).map(df(_)): _*)
val net3: DataFrame = df.select(df.columns.filter(_.startsWith("net3")).map(df(_)): _*)
val net4: DataFrame = df.select(df.columns.filter(_.startsWith("net4")).map(df(_)): _*)
val net5: DataFrame = df.select(df.columns.filter(_.startsWith("net5")).map(df(_)): _*)
net1.show
net2.show
net3.show
net4.show
net5.show
}
结果:
+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+ |net1|net1_a|net1_b|net1_c|net1_d|net1_e|net2|net2_a|net2_b|net2_c|net2_d|net2_e|net3|net3_a|net3_b|net3_c|net3_d|net4_e|net5|net5_a|net5_b|net5_c|net5_d|net5_e| +----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+ | 1| 2| 3| 4| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| +----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+----+------+------+------+------+------+ +----+------+------+------+------+------+ |net1|net1_a|net1_b|net1_c|net1_d|net1_e| +----+------+------+------+------+------+ | 1| 2| 3| 4| 4| 5| +----+------+------+------+------+------+ +----+------+------+------+------+------+ |net2|net2_a|net2_b|net2_c|net2_d|net2_e| +----+------+------+------+------+------+ | 6| 7| 8| 9| 10| 11| +----+------+------+------+------+------+ +----+------+------+------+------+ |net3|net3_a|net3_b|net3_c|net3_d| +----+------+------+------+------+ | 12| 13| 14| 15| 16| +----+------+------+------+------+ +------+ |net4_e| +------+ | 17| +------+ +----+------+------+------+------+------+ |net5|net5_a|net5_b|net5_c|net5_d|net5_e| +----+------+------+------+------+------+ | 18| 19| 20| 21| 22| 23| +----+------+------+------+------+------+
现在您可以对结果数据帧进行null检查。
答案 3 :(得分:0)
我会用net_type
字段将不同组的净字段折叠为一组。然后,您可以进行分区写入,这将使您可以轻松地加载单个集或根据需要加载多个集。
这给您带来了很多好处:
net_type
上过滤的值自动确定要为您加载哪些值这是执行此操作的代码:
import org.apache.spark.sql.functions._
case class Net(net1:Integer,
net1_a:Integer,
net1_b:Integer,
net2:Integer,
net2_a:Integer,
net2_b:Integer)
val df = Seq(
Net(1, 1, 1, null, null, null),
Net(2, 2, 2, null, null, null),
Net(null, null, null, 3, 3, 3)
).toDS
// You could find these automatically if you wanted
val columns = Seq("net1", "net2")
// Turn each group of fields into a struct with a populated "net_type" field
val structColumns = columns.map(c =>
when(col(c).isNotNull,
struct(
lit(c) as "net_type",
col(c) as "net",
col(c + "_a") as "net_a",
col(c + "_b") as "net_b"
)
)
)
// Put into one column the populated group for each row
val df2 = df.select(coalesce(structColumns:_*) as "net")
// Flatten back down to top level fields instead of being in a struct
val df3 = df2.selectExpr("net.*")
df.write.partitionBy("net_type").parquet("/some/file/path.parquet")
这会给你这样的行:
scala> df3.show
+--------+---+-----+-----+
|net_type|net|net_a|net_b|
+--------+---+-----+-----+
| net1| 1| 1| 1|
| net1| 2| 2| 2|
| net2| 3| 3| 3|
+--------+---+-----+-----+
文件系统中的文件如下:
/some/file/path.parquet/
net_type=net1/
part1.parquet
..
net_type=net2/
part1.parquet
..