我有一个包含4列的pyspark数据框。
示例数据框:
id | name | age | job
-------------------------------------------------------------------
["98475", "748574"] | ["98475",748574] |
-------------------------------------------------------------------
["75473","98456"] | ["98456"] |
-------------------------------------------------------------------
["23456","28596"] | ["84758","56849","86954"]
-------------------------------------------------------------------
我想比较两列(array<string> types)
:
示例:
Array_A (id) | Array_B(name)
------------------------------
如果Array_B中的所有值都匹配,那么Array_A中的值是否==>好的
如果Array_B中的所有值都在array_A ==>中等
如果Array_B的值在array_A中不存在==>找不到
我做了一个UDF:
def contains(x,y):
z = len(set(x) - set(y))
if ((z == 0) & (set(x) == set(y))):
return "ok"
elif (set(y).isin(set(x))) & (z != 0):
return "medium"
else set(y) != set(x):
return "not found in raw"
contains_udf = udf(contains)
然后:
new_df= df.withColumn(
"new_column",
F.when(
(df.id.isNotNull() & df.name.isNotNull()),
contains_udf(df.id,df.name)
).otherwise(
F.lit(None)
)
)
我收到此错误:
else set(y) != set(x):
^
SyntaxError: invalid syntax
如何使用udf或其他解决方案(例如array_contains)解决该问题? 谢谢
答案 0 :(得分:1)
@ Buckeye14Guy和@Sid指出了代码中的主要问题,您可能还需要清除一些逻辑:
from pyspark.sql.functions import udf
def contains(x,y):
try:
sx, sy = set(x), set(y)
if len(sy) == 0:
return 'list is empty'
elif sx == sy:
return "ok"
elif sy.issubset(sx):
return "medium"
# below none of sy is in sx
elif sx - sy == sx:
return "none found in raw" # including empty x
else:
return "some missing in raw"
# in exception, for example `x` or `y` is None (not a list)
except:
return "not an iterable or other errors"
udf_contains = udf(contains, 'string')
df.withColumn('new_column', udf_contains('id', 'name')).show(truncate=False)
+---------------+---------------------+-----------------+
|id |name |new_column |
+---------------+---------------------+-----------------+
|[98475, 748574]|[98475, 748574] |ok |
|[75473, 98456] |[98456] |medium |
|[23456, 28596] |[84758, 56849, 86954]|none found in raw|
+---------------+---------------------+-----------------+
答案 1 :(得分:0)
else set(y) != set(x):
^
SyntaxError: invalid syntax
这是因为else
语句不需要条件。它包含仅在不满足先前条件的情况下才能执行的代码。改用:
elif set(y) != set(x):
#code
OR
else :
#code