Question

我正在尝试比较2个数据帧的架构。基本上，列和类型是相同的，但是＆＃34; nullable＆＃34;可以是不同的：

Dataframe A

2018-05-30T15:01:01.111Z
2018-05-30T16:01:01.111
2018-05-30T16:01:01.111

Dataframe B

StructType(List(
StructField(ClientId,StringType,True),
StructField(PublicId,StringType,True),
StructField(ExternalIds,ArrayType(StructType(List(
    StructField(AppId,StringType,True),
    StructField(ExtId,StringType,True),
)),True),True),
....

当我执行StructType(List( StructField(ClientId,StringType,True), StructField(PublicId,StringType,False), StructField(ExternalIds,ArrayType(StructType(List( StructField(AppId,StringType,True), StructField(ExtId,StringType,False), )),True),True), ....时，显然会df_A.schema == df_B.schema结果。但是我想忽略＆＃34; nullable＆＃34;参数，无论是假还是真，如果结构相同，它应该返回False。

有可能吗？

Answer 1

使用以下两个DataFrame架构的示例：

df_A.printSchema()
#root
# |-- ClientId: string (nullable = true)
# |-- PublicId: string (nullable = true)
# |-- PartyType: string (nullable = true)

df_B.printSchema()
#root
# |-- ClientId: string (nullable = true)
# |-- PublicId: string (nullable = true)
# |-- PartyType: string (nullable = false)

并假设这些字段的顺序相同，您可以访问模式中每个字段的name和dataType并压缩它们进行比较：

print(
    all(
        (a.name, a.dataType) == (b.name, b.dataType) 
        for a,b in zip(df_A.schema, df_B.schema)
    )
)
#True

如果它们的顺序不同，您可以比较已排序的字段：

print(
    all(
        (a.name, a.dataType) == (b.name, b.dataType) 
        for a,b in zip(
            sorted(df_A.schema, key=lambda x: (x.name, x.dataType)), 
            sorted(df_B.schema, key=lambda x: (x.name, x.dataType))
        )
    )
)
#True

如果两个DataFrame可能具有不同数量的列，您可以先将模式长度作为短路检查进行比较 - 如果失败，请不要在字段中进行迭代：

print(len(df_A.schema) == len(df_B.schema))
#True

比较模式忽略可空

1 个答案: