Question

我在Pyspark中有一个数据框

df.show()


+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
|  1| sam|   null|      null| null|  true|
|  2| Ram|      Y|      0.05|   10| false|
|  3| Ian|      N|      0.01|    1| false|
|  4| Jim|      N|       1.2|    3|  true|
+---+----+-------+----------+-----+------+

架构如下：

DataFrame[id: int, name: string, testing: string, avg_result: string, score: string, active: boolean]

我想将Y转换为True，将N转换为False true转换为True，将false转换为{{ 1}}。

当我喜欢以下内容时：

False

我遇到错误，数据框没有变化

for col in cols:
    df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').
                       when(f.col(col) == 'true', True).when(f.col(col) == 'false', False).otherwise(f.col(col)))

当我喜欢以下内容时

pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN (testing = N) THEN False WHEN (testing = Y) THEN True WHEN (testing = true) THEN true WHEN (testing = false) THEN false ELSE testing' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"

+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
|  1| sam|   null|      null| null|  true|
|  2| Ram|      Y|      0.05|   10| false|
|  3| Ian|      N|      0.01|    1| false|
|  4| Jim|      N|       1.2|    3|  true|
+---+----+-------+----------+-----+------+

我遇到错误

for col in cols:
    df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').otherwise(f.col(col)))

但是数据框更改为

pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN if ((isnull(active) || isnull(cast(N as double)))) null else CASE cast(cast(N as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False WHEN if ((isnull(active) || isnull(cast(Y as double)))) null else CASE cast(cast(Y as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True ELSE active' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"

+---+----+-------+----------+-----+------+ | id|name|testing|avg_result|score|active| +---+----+-------+----------+-----+------+ | 1| sam| null| null| null| true| | 2| Ram| True| 0.05| 10| false| | 3| Ian| False| 0.01| 1| false| | 4| Jim| False| 1.2| 3| true| +---+----+-------+----------+-----+------+

New attempt

for col in cols: df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True'). when(f.col(col) == 'true', 'True').when(f.col(col) == 'false', 'False').otherwise(f.col(col)))

Error received

如何获得数据框架？

pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN if ((isnull(active) || isnull(cast(N as double)))) null else CASE cast(cast(N as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False WHEN if ((isnull(active) || isnull(cast(Y as double)))) null else CASE cast(cast(Y as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True WHEN if ((isnull(active) || isnull(cast(true as double)))) null else CASE cast(cast(true as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True WHEN if ((isnull(active) || isnull(cast(false as double)))) null else CASE cast(cast(false as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False ELSE active' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"

Answer 1

正如我在评论中提到的，问题是类型不匹配。在进行比较之前，您需要将布尔列转换为字符串。最后，您还需要将列也转换为otherwise()中的字符串（列中不能有混合类型）。

您的代码易于修改以获取正确的输出：

import pyspark.sql.functions as f

cols = ["testing", "active"]
for col in cols:
    df = df.withColumn(
        col, 
        f.when(
            f.col(col) == 'N',
            'False'
        ).when(
            f.col(col) == 'Y',
            'True'
        ).when(
            f.col(col).cast('string') == 'true',
            'True'
        ).when(
            f.col(col).cast('string') == 'false',
            'False'
        ).otherwise(f.col(col).cast('string'))
    )
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#|  1| sam|   null|      null| null|  True|
#|  2| Ram|   True|      0.05|   10| False|
#|  3| Ian|  False|      0.01|    1| False|
#|  4| Jim|  False|       1.2|    3|  True|
#+---+----+-------+----------+-----+------+

但是，还有一些替代方法。例如，这是使用pyspark.sql.Column.isin()的好地方：

df = reduce(
    lambda df, col: df.withColumn(
        col, 
        f.when(
            f.col(col).cast('string').isin(['N', 'false']),
            'False'
        ).when(
            f.col(col).cast('string').isin(['Y', 'true']),
            'True'
        ).otherwise(f.col(col).cast('string'))
    ),
    cols,
    df
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#|  1| sam|   null|      null| null|  True|
#|  2| Ram|   True|      0.05|   10| False|
#|  3| Ian|  False|      0.01|    1| False|
#|  4| Jim|  False|       1.2|    3|  True|
#+---+----+-------+----------+-----+------+

（这里我用reduce消除了for循环，但是您可以保留它。）

您还可以使用pyspark.sql.DataFrame.replace()，但必须首先将活动列转换为字符串：

df = df.withColumn('active', f.col('active').cast('string'))\
    .replace(['Y', 'true',], 'True', subset=cols)\
    .replace(['N', 'false'], 'False', subset=cols)\
df.show()
# results omitted, but it's the same as above

或仅使用一次replace：

df = df.withColumn('active', f.col('active').cast('string'))\
    .replace(['Y', 'true', 'N', 'false'], ['True', 'True', 'False', 'False'], subset=cols)

Answer 2

查看架构和应用的转换，在String和返回的Boolean之间存在类型不匹配。例如。 'N'返回为'False'（字符串），'false'返回为False（布尔值）

您可以将转换后的列转换为String，以将Y转换为True，N转换为False，true转换为True和{ {1}}至false。

False

应用转换之前

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as f

data = [
  (1, "sam", None, None, None, True),
  (2, "Ram", "Y", 0.05, 10, False),
  (3, "Ian", "N", 0.01, 1, False),
  (4, "Jim", "N", 1.2, 3, True)
  ]

schema = StructType([
  StructField("id", IntegerType(), True),
  StructField("name", StringType(), True),
  StructField("testing", StringType(), True),
  StructField("avg_result", StringType(), True),
  StructField("score", StringType(), True),
  StructField("active", BooleanType(), True)
  ])

df = sc.parallelize(data).toDF(schema)

在else子句>>> df.printSchema() root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- testing: string (nullable = true) |-- avg_result: string (nullable = true) |-- score: string (nullable = true) |-- active: boolean (nullable = true) >>> df.show() +---+----+-------+----------+-----+------+ | id|name|testing|avg_result|score|active| +---+----+-------+----------+-----+------+ | 1| sam| null| null| null| true| | 2| Ram| Y| 0.05| 10| false| | 3| Ian| N| 0.01| 1| false| | 4| Jim| N| 1.2| 3| true| +---+----+-------+----------+-----+------+

中使用强制转换进行转换

.otherwise(f.col(col).cast("string"))

结果

cols = ["testing", "active"]

for col in cols:
    df = df.withColumn(col, 
      f.when(f.col(col) == 'N', 'False')
      .when(f.col(col) == 'Y', 'True')
      .when(f.col(col).cast("string") == 'true', 'True')
      .when(f.col(col).cast("string") == 'false', 'False'))

Answer 3

您可以将它们转换为布尔值，然后返回字符串。

编辑：我正在使用spark 2.3.0

例如

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, initcap
from pyspark.sql.types import IntegerType, BooleanType, StringType, StructType, StructField

data = [(1, "Y"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"), (3, None)]
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
print(df.dtypes)
df.show()
df = df.withColumn("txt", col("txt").cast(BooleanType()))
print(df.dtypes)
df.show()
df = df.withColumn("txt", col("txt").cast(StringType()))
df = df.withColumn("txt", initcap(col("txt")))
print(df.dtypes)
df.show()

会给你

[('id', 'int'), ('txt', 'string')]

+---+-----+
| id|  txt|
+---+-----+
|  1|    Y|
|  1|    N|
|  2|false|
|  2|    1|
|  3| NULL|
|  3| null|
+---+-----+

[('id', 'int'), ('txt', 'boolean')]

+---+-----+
| id|  txt|
+---+-----+
|  1| true|
|  1|false|
|  2|false|
|  2| true|
|  3| null|
|  3| null|
+---+-----+

[('id', 'int'), ('txt', 'string')]

+---+-----+
| id|  txt|
+---+-----+
|  1| True|
|  1|False|
|  2|False|
|  2| True|
|  3| null|
|  3| null|
+---+-----+

在Pyspark中将布尔值转换为字符串时使用when和else

3 个答案: