Question

我有一个CSV输入文件。我们使用以下内容阅读

val rawdata = spark.
  read.
  format("csv").
  option("header", true).
  option("inferSchema", true).
  load(filename)

这样可以整齐地读取数据并构建模式。

下一步是将列拆分为String和Integer列。怎么样？

如果以下是我的数据集的架构...

scala> rawdata.printSchema
root
 |-- ID: integer (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Last Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- DailyRate: integer (nullable = true)
 |-- Dept: string (nullable = true)
 |-- DistanceFromHome: integer (nullable = true)

我想将其拆分为两个变量（StringCols，IntCols），其中：

StringCols 应该有＆＃34;名字＆＃34;，＆＃34;姓氏＆＃34;，＆＃34;部门＆＃34;
IntCols 应该有＆＃34; ID＆＃34;，＆＃34;年龄＆＃34;，＆＃34; DailyRate＆＃34;，＆＃34; DistanceFromHome＆＃34;

这就是我的尝试：

val names = rawdata.schema.fieldNames
val types = rawdata.schema.fields.map(r => r.dataType)

现在在types，我想循环查找所有StringType并在列名中查找列名，类似于IntegerType。

Answer 1

在这里，您可以使用基础schema和dataType

按类型过滤列

import org.apache.spark.sql.types.{IntegerType, StringType}

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name)
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name)

val dfOfString = df.select(stringCols.head, stringCols.tail : _*)
val dfOfInt = df.select(intCols.head, intCols.tail : _*)

Answer 2

使用dtypes运算符：

dtypes：Array [（String，String）] 以数组形式返回所有列名及其数据类型。

这将为您提供更加惯用的处理数据集模式的方法。

val rawdata = Seq(
  (1, "First Name", "Last Name", 43, 2000, "Dept", 0)
).toDF("ID", "First Name", "Last Name", "Age", "DailyRate", "Dept", "DistanceFromHome")
scala> rawdata.dtypes.foreach(println)
(ID,IntegerType)
(First Name,StringType)
(Last Name,StringType)
(Age,IntegerType)
(DailyRate,IntegerType)
(Dept,StringType)
(DistanceFromHome,IntegerType)

我想将其拆分为两个变量（StringCols，IntCols）

（如果你不介意的话，我宁愿坚持使用不可变的值）

val emptyPair = (Seq.empty[String], Seq.empty[String])
val (stringCols, intCols) = rawdata.dtypes.foldLeft(emptyPair) { case ((strings, ints), (name: String, typ)) =>
  typ match {
    case _ if typ == "StringType" => (name +: strings, ints)
    case _ if typ == "IntegerType" => (strings, name +: ints)
  }
}

StringCols应该有＆＃34;名字＆＃34;，＆＃34;姓氏＆＃34;，＆＃34;部门＆＃34;和IntCols应该有＆＃34; ID＆＃34;，＆＃34;年龄＆＃34;，＆＃34; DailyRate＆＃34;，＆＃34; DistanceFromHome＆＃34;

你可以reverse收藏品，但我宁愿避免这样做，因为性能价格昂贵并且不会给你任何回报。

如何将列拆分为每种类型的两个集合？

2 个答案: