我在Pyspark
中有一个数据框
df.show()
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| Y| 0.05| 10| false|
| 3| Ian| N| 0.01| 1| false|
| 4| Jim| N| 1.2| 3| true|
+---+----+-------+----------+-----+------+
架构如下:
DataFrame[id: int, name: string, testing: string, avg_result: string, score: string, active: boolean]
我想将Y
转换为True
,将N
转换为False
true
转换为True
,将false
转换为{{ 1}}。
当我喜欢以下内容时:
False
我遇到错误,数据框没有变化
for col in cols:
df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').
when(f.col(col) == 'true', True).when(f.col(col) == 'false', False).otherwise(f.col(col)))
当我喜欢以下内容时
pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN (testing = N) THEN False WHEN (testing = Y) THEN True WHEN (testing = true) THEN true WHEN (testing = false) THEN false ELSE testing' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| Y| 0.05| 10| false|
| 3| Ian| N| 0.01| 1| false|
| 4| Jim| N| 1.2| 3| true|
+---+----+-------+----------+-----+------+
我遇到错误
for col in cols:
df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').otherwise(f.col(col)))
但是数据框更改为
pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN if ((isnull(active) || isnull(cast(N as double)))) null else CASE cast(cast(N as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False WHEN if ((isnull(active) || isnull(cast(Y as double)))) null else CASE cast(cast(Y as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True ELSE active' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| True| 0.05| 10| false|
| 3| Ian| False| 0.01| 1| false|
| 4| Jim| False| 1.2| 3| true|
+---+----+-------+----------+-----+------+
New attempt
for col in cols:
df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').
when(f.col(col) == 'true', 'True').when(f.col(col) == 'false', 'False').otherwise(f.col(col)))
Error received
如何获得数据框架?
pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN if ((isnull(active) || isnull(cast(N as double)))) null else CASE cast(cast(N as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False WHEN if ((isnull(active) || isnull(cast(Y as double)))) null else CASE cast(cast(Y as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True WHEN if ((isnull(active) || isnull(cast(true as double)))) null else CASE cast(cast(true as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True WHEN if ((isnull(active) || isnull(cast(false as double)))) null else CASE cast(cast(false as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False ELSE active' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"
答案 0 :(得分:3)
正如我在评论中提到的,问题是类型不匹配。在进行比较之前,您需要将布尔列转换为字符串。最后,您还需要将列也转换为otherwise()
中的字符串(列中不能有混合类型)。
您的代码易于修改以获取正确的输出:
import pyspark.sql.functions as f
cols = ["testing", "active"]
for col in cols:
df = df.withColumn(
col,
f.when(
f.col(col) == 'N',
'False'
).when(
f.col(col) == 'Y',
'True'
).when(
f.col(col).cast('string') == 'true',
'True'
).when(
f.col(col).cast('string') == 'false',
'False'
).otherwise(f.col(col).cast('string'))
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+
但是,还有一些替代方法。例如,这是使用pyspark.sql.Column.isin()
的好地方:
df = reduce(
lambda df, col: df.withColumn(
col,
f.when(
f.col(col).cast('string').isin(['N', 'false']),
'False'
).when(
f.col(col).cast('string').isin(['Y', 'true']),
'True'
).otherwise(f.col(col).cast('string'))
),
cols,
df
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+
(这里我用reduce
消除了for
循环,但是您可以保留它。)
您还可以使用pyspark.sql.DataFrame.replace()
,但必须首先将活动列转换为字符串:
df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true',], 'True', subset=cols)\
.replace(['N', 'false'], 'False', subset=cols)\
df.show()
# results omitted, but it's the same as above
或仅使用一次replace
:
df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true', 'N', 'false'], ['True', 'True', 'False', 'False'], subset=cols)
答案 1 :(得分:1)
查看架构和应用的转换,在String和返回的Boolean之间存在类型不匹配。例如。 'N'
返回为'False'
(字符串),'false'
返回为False
(布尔值)
您可以将转换后的列转换为String,以将Y
转换为True
,N
转换为False
,true
转换为True
和{ {1}}至false
。
False
应用转换之前
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as f
data = [
(1, "sam", None, None, None, True),
(2, "Ram", "Y", 0.05, 10, False),
(3, "Ian", "N", 0.01, 1, False),
(4, "Jim", "N", 1.2, 3, True)
]
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("testing", StringType(), True),
StructField("avg_result", StringType(), True),
StructField("score", StringType(), True),
StructField("active", BooleanType(), True)
])
df = sc.parallelize(data).toDF(schema)
在else子句>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- testing: string (nullable = true)
|-- avg_result: string (nullable = true)
|-- score: string (nullable = true)
|-- active: boolean (nullable = true)
>>> df.show()
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| Y| 0.05| 10| false|
| 3| Ian| N| 0.01| 1| false|
| 4| Jim| N| 1.2| 3| true|
+---+----+-------+----------+-----+------+
.otherwise(f.col(col).cast("string"))
结果
cols = ["testing", "active"]
for col in cols:
df = df.withColumn(col,
f.when(f.col(col) == 'N', 'False')
.when(f.col(col) == 'Y', 'True')
.when(f.col(col).cast("string") == 'true', 'True')
.when(f.col(col).cast("string") == 'false', 'False'))
答案 2 :(得分:0)
您可以将它们转换为布尔值,然后返回字符串。
编辑:我正在使用spark 2.3.0
例如
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, initcap
from pyspark.sql.types import IntegerType, BooleanType, StringType, StructType, StructField
data = [(1, "Y"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"), (3, None)]
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
print(df.dtypes)
df.show()
df = df.withColumn("txt", col("txt").cast(BooleanType()))
print(df.dtypes)
df.show()
df = df.withColumn("txt", col("txt").cast(StringType()))
df = df.withColumn("txt", initcap(col("txt")))
print(df.dtypes)
df.show()
会给你
[('id', 'int'), ('txt', 'string')]
+---+-----+
| id| txt|
+---+-----+
| 1| Y|
| 1| N|
| 2|false|
| 2| 1|
| 3| NULL|
| 3| null|
+---+-----+
[('id', 'int'), ('txt', 'boolean')]
+---+-----+
| id| txt|
+---+-----+
| 1| true|
| 1|false|
| 2|false|
| 2| true|
| 3| null|
| 3| null|
+---+-----+
[('id', 'int'), ('txt', 'string')]
+---+-----+
| id| txt|
+---+-----+
| 1| True|
| 1|False|
| 2|False|
| 2| True|
| 3| null|
| 3| null|
+---+-----+