Question

我在一个不可为空的数据框中有一个StructField。简单的例子：

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields

返回：

[StructField（姓名，StringType，真）， StructField（年龄，LongType，真） StructField（FOO，BooleanType，假）]

请注意，字段foo不可为空。问题是（由于我不会进入的原因）我希望它可以为空。我发现这篇文章Change nullable property of column in spark dataframe提出了一种方法，因此我将其中的代码改编为：

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)

失败了：

TypeError：StructField（name，StringType，true）不是JSON可序列化的

我也在堆栈跟踪中看到了这一点：

引发ValueError（“检测到循环引用”）

所以我有点卡住了。任何人都可以通过一种方式修改此示例，使我能够定义列foo可以为空的数据帧吗？

Answer 1

好像你错过了StructType（newSchema）。

l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()

Answer 2

我知道这个问题已经回答了，但是当我想到这个时，我正在寻找一个更通用的解决方案：

def set_df_columns_nullable(spark, df, column_list, nullable=True):
    for struct_field in df.schema:
        if struct_field.name in column_list:
            struct_field.nullable = nullable
    df_mod = spark.createDataFrame(df.rdd, df.schema)
    return df_mod

然后您可以这样称呼它：

set_df_columns_nullable(spark,df,['name','age'])

Answer 3

df1 = df.rdd.toDF()
df1.printSchema()

输出：

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- foo: boolean (nullable = true)

Answer 4

对于一般情况，可以通过该特定列的nullable的{{1}}属性来更改列的可空性。这是一个示例：

StructField

我可以在Spark数据帧中更改列的可为空性吗？

4 个答案: