Question

My Spark程序需要读取包含整数矩阵的文件。列用“，”分隔。每次运行程序时列数都不相同。

我将文件作为数据框读取：

var df = spark.read.csv(originalPath);

但是当我打印架构时，它会将所有列作为字符串提供给我。

我将所有列转换为Integers，如下所示，但之后再次打印df架构时，列仍然是字符串。

df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));

我感谢任何有助于解决投射问题的帮助。

感谢。

Answer 1

DataFrames是不可变的。您的代码会为每个值创建新的DataFrame并将其丢弃。

最好使用map和select：

val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)

但你可以foldLeft：

df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))

甚至（请不要）可变引用：

var df = Seq(("1", "2", "3")).toDF

df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))

Answer 2

或者正如您所提到的，每次列号都不相同，您可以使用最大数量的可能列并从中创建架构，将 IntegerType 作为列类型。在加载期间，文件推断此架构以自动将数据帧列从字符串转换为整数。在这种情况下，无需显式转换。

import org.apache.spark.sql.types._

val csvSchema = StructType(Array(
  StructField("_c0", IntegerType, true),
  StructField("_c1", IntegerType, true),
  StructField("_c2", IntegerType, true),
  StructField("_c3", IntegerType, true)))

val df = spark.read.schema(csvSchema).csv(originalPath)

scala> df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
 |-- _c3: integer (nullable = true)

在数据框中转换列的类型

2 个答案: