Question

我有一个包含4列的pyspark数据框。

示例数据框：

id                       |  name                          | age |  job
    -------------------------------------------------------------------
     ["98475", "748574"] |  ["98475",748574]              |
    -------------------------------------------------------------------
      ["75473","98456"]  |   ["98456"]                    |
    -------------------------------------------------------------------
      ["23456","28596"]  |   ["84758","56849","86954"]      
    -------------------------------------------------------------------

我想比较两列(array<string> types)：

示例：

Array_A (id)  | Array_B(name)
------------------------------

如果Array_B中的所有值都匹配，那么Array_A中的值是否==>好的

如果Array_B中的所有值都在array_A ==>中等

如果Array_B的值在array_A中不存在==>找不到

我做了一个UDF：

def contains(x,y):
        z = len(set(x) - set(y))
        if ((z == 0) & (set(x) == set(y))):
            return "ok"
        elif (set(y).isin(set(x))) & (z != 0):
            return "medium"
        else set(y) != set(x):
            return "not found in raw"


contains_udf = udf(contains)

然后：

new_df= df.withColumn(
    "new_column",
    F.when(
        (df.id.isNotNull() & df.name.isNotNull()),
        contains_udf(df.id,df.name)
    ).otherwise(
        F.lit(None)
    )

)

我收到此错误：

else set(y) != set(x):
           ^
SyntaxError: invalid syntax

如何使用udf或其他解决方案（例如array_contains）解决该问题？谢谢

Answer 1

@ Buckeye14Guy和@Sid指出了代码中的主要问题，您可能还需要清除一些逻辑：

from pyspark.sql.functions import udf

def contains(x,y): 
  try:
    sx, sy = set(x), set(y) 
    if len(sy) == 0: 
        return 'list is empty'
    elif sx == sy: 
        return "ok"    
    elif sy.issubset(sx): 
        return "medium"  
    # below none of sy is in sx
    elif sx - sy == sx: 
        return "none found in raw"  # including empty x
    else: 
        return "some missing in raw"
  # in exception, for example `x` or `y` is None (not a list)
  except:
    return "not an iterable or other errors"

udf_contains = udf(contains, 'string')

df.withColumn('new_column', udf_contains('id', 'name')).show(truncate=False)
+---------------+---------------------+-----------------+
|id             |name                 |new_column       |
+---------------+---------------------+-----------------+
|[98475, 748574]|[98475, 748574]      |ok               |
|[75473, 98456] |[98456]              |medium           |
|[23456, 28596] |[84758, 56849, 86954]|none found in raw|
+---------------+---------------------+-----------------+

Answer 2

else set(y) != set(x):
           ^
SyntaxError: invalid syntax

这是因为else语句不需要条件。它包含仅在不满足先前条件的情况下才能执行的代码。改用：

elif set(y) != set(x):
    #code

OR

else :
    #code

如何验证一个数组包含另一个数组

2 个答案: