Question

我在Spark中有以下数据框架和架构

val df = spark.read.options(Map("header"-> "true")).csv("path")

scala> df show()

+-------+-------+-----+
|   user|  topic| hits|
+-------+-------+-----+
|     om|  scala|  120|
| daniel|  spark|   80|
|3754978|  spark|    1|
+-------+-------+-----+

scala> df printSchema

root
 |-- user: string (nullable = true)
 |--  topic: string (nullable = true)
 |--  hits: string (nullable = true)

我想将列匹配更改为整数

我试过了：

scala>    df.createOrReplaceTempView("test")
    val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test")

scala> dfNew.printSchema

root
 |-- user: string (nullable = true)
 |--  topic: string (nullable = true)
 |--  hits: string (nullable = true)
 |-- hist2: integer (nullable = true)

但是当我打印数据帧时，列hist 2填充了NULLS

scala> dfNew show()

+-------+-------+-----+-----+
|   user|  topic| hits|hist2|
+-------+-------+-----+-----+
|     om|  scala|  120| null|
| daniel|  spark|   80| null|
|3754978|  spark|    1| null|
+-------+-------+-----+-----+

我也试过这个：

scala> val df2 = df.withColumn("hitsTmp",
df.hits.cast(IntegerType)).drop("hits"
).withColumnRenamed("hitsTmp", "hits")

得到了这个：

<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram
e

还试过这个：

scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits")

and got this:
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col
umns: [user,  topic,  hits]; line 1 pos 0;
'Project [user#0, 'topic, cast('hits as int) AS hits#22]
+- Relation[user#0, topic#1, hits#2] csv

带

 scala> val df2 = df.selectExpr ("cast(hits as int) hits")

我得到了类似的错误。

任何帮助将不胜感激。我知道这个问题之前已经解决了，但我尝试了3种不同的方法（在这里发布），但没有一种方法正在发挥作用。

感谢。

Answer 1

您可以通过以下方式将列转换为Integer类型

df.withColumn("hits", df("hits").cast("integer"))

或者

data.withColumn("hitsTmp",
      data("hits").cast(IntegerType)).drop("hits").
      withColumnRenamed("hitsTmp", "hits")

或者

data.selectExpr ("user","topic","cast(hits as int) hits")

Answer 2

我们如何让 spark cast 抛出异常而不是生成所有空值？我是否必须计算转换之前和之后的空值总数才能查看转换是否真的成功？

这篇文章 How to test datatype conversion during casting 就是这样做的。我想知道这里是否有更好的解决方案。

Answer 3

响应被延迟，但我遇到了相同的问题并已工作。因此，请考虑将其放在此处。可能会对某人有所帮助。尝试将架构声明为StructType。从CSV文件读取并使用案例类提供推论架构会给数据类型带来奇怪的错误。虽然，所有数据格式都已正确指定。

Answer 4

我知道这个答案可能对OP没用，因为它会延迟大约2年。但是，这可能会对遇到此问题的人有所帮助。

就像您一样，我有一个数据框，其中包含一列字符串，我试图将其转换为整数：

> df.show
+-------+
|     id|
+-------+
|4918088|
|4918111|
|4918154|
   ...

> df.printSchema
root
 |-- id: string (nullable = true)

但是在对IntegerType进行强制转换之后，我获得的唯一一件事就是一列null：

> df.withColumn("test", $"id".cast(IntegerType))
    .select("id","test")
    .show
+-------+----+
|     id|test|
+-------+----+
|4918088|null|
|4918111|null|
|4918154|null|
      ...

默认情况下，如果您尝试将包含非数字字符的字符串强制转换为整数，则该列的强制转换不会失败，但是这些值将设置为null，如以下示例所示：

> val testDf = sc.parallelize(Seq(("1"), ("2"), ("3A") )).toDF("n_str")
> testDf.withColumn("n_int", $"n_str".cast(IntegerType))
        .select("n_str","n_int")
        .show
+-----+-----+
|n_str|n_int|
+-----+-----+
|    1|    1|
|    2|    2|
|   3A| null|
+-----+-----+

与初始数据帧有关的事情是，乍一看，当我们使用show方法时，我们看不到任何非数字字符。但是，如果您在数据框中排成一行，则会看到不同的内容：

> df.first
org.apache.spark.sql.Row = [4?9?1?8?0?8?8??]

为什么会这样？您可能正在读取包含不支持的编码的csv文件。

您可以通过更改正在读取的文件的编码来解决此问题。如果不是这种选择，您还可以在执行类型转换之前清理每一列。一个例子：

> val df_cast = df.withColumn("test", regexp_replace($"id", "[^0-9]","").cast(IntegerType))
                  .select("id","test")
> df_cast.show
+-------+-------+
|     id|   test|
+-------+-------+
|4918088|4918088|
|4918111|4918111|
|4918154|4918154|
       ...

> df_cast.printSchema
root
 |-- id: string (nullable = true)
 |-- test: integer (nullable = true)

Answer 5

尝试删除hist周围的引号如果那不起作用，那么尝试修剪列：

dfNew = spark.sql("select *, cast(trim(hist) as integer) as hist2 from test")

Answer 6

我有一个类似的问题，我将字符串转换为整数，但我意识到我需要将其转换为长整数。起初很难意识到这一点，因为当我尝试使用

打印类型时，我的列的类型是一个整数

print(df.dtypes)

spark sql cast函数使用NULLS创建列

6 个答案: