假设我有以下案例
from pyspark.sql.types import *
schema = StructType([ # schema
StructField("id", StringType(), True),
StructField("ev", ArrayType(StringType()), True),
StructField("ev2", ArrayType(StringType()), True),])
df = spark.createDataFrame([{"id": "se1", "ev": ["ev11", "ev12"], "ev2": ["ev11"]},
{"id": "se2", "ev": ["ev11"], "ev2": ["ev11", "ev12"]},
{"id": "se3", "ev": ["ev21"], "ev2": ["ev11", "ev12"]},
{"id": "se4", "ev": ["ev21", "ev22"], "ev2": ["ev21", "ev22"]}],
schema=schema)
这给了我:
df.show()
+---+------------+------------+
| id| ev| ev2|
+---+------------+------------+
|se1|[ev11, ev12]| [ev11]|
|se2| [ev11]|[ev11, ev12]|
|se3| [ev21]|[ev11, ev12]|
|se4|[ev21, ev22]|[ev21, ev22]|
+---+------------+------------+
我想为“ev”列的内容在“ev2”列内的行创建一个新的boolean列(或只选择真实的case),返回:
df_target.show()
+---+------------+------------+
| id| ev| ev2|
+---+------------+------------+
|se2| [ev11]|[ev11, ev12]|
|se4|[ev21, ev22]|[ev21, ev22]|
+---+------------+------------+
或:
df_target.show()
+---+------------+------------+-------+
| id| ev| ev2|evInEv2|
+---+------------+------------+-------+
|se1|[ev11, ev12]| [ev11]| false|
|se2| [ev11]|[ev11, ev12]| true|
|se3| [ev21]|[ev11, ev12]| false|
|se4|[ev21, ev22]|[ev21, ev22]| true|
+---+------------+------------+-------+
我尝试使用isin
方法:
df.withColumn('evInEv2', df['ev'].isin(df['ev2'])).show()
+---+------------+------------+-------+
| id| ev| ev2|evInEv2|
+---+------------+------------+-------+
|se1|[ev11, ev12]| [ev11]| false|
|se2| [ev11]|[ev11, ev12]| false|
|se3| [ev21]|[ev11, ev12]| false|
|se4|[ev21, ev22]|[ev21, ev22]| true|
+---+------------+------------+-------+
但看起来它只检查它是否是同一个数组。
我还尝试了array_contains
中的pyspark.sql.functions
函数,但只接受一个对象而不是一个要检查的数组。
由于说出了正确的问题,我甚至在寻找这个问题时遇到了困难。
谢谢!
答案 0 :(得分:3)
这是一个使用udf
的选项,我们会检查列ev
和ev2
之间的差异长度。当结果数组的长度为0
,或ev
中的所有元素都包含在ev2
中时,我们返回True
;否则False
。
def contains(x,y):
z = len(set(x) - set(y))
if z == 0:
return True
else:
return False
contains_udf = udf(contains)
df.withColumn("evInEv2", contains_udf(df.ev,df.ev2)).show()
+---+------------+------------+-------+
| id| ev| ev2|evInEv2|
+---+------------+------------+-------+
|se1|[ev11, ev12]| [ev11]| false|
|se2| [ev11]|[ev11, ev12]| true|
|se3| [ev21]|[ev11, ev12]| false|
|se4|[ev21, ev22]|[ev21, ev22]| true|
+---+------------+------------+-------+
答案 1 :(得分:1)
或者,您可以使用
subsetOf=udf(lambda A,B: set(A).issubset(set(B)))
df.withColumn("evInEv2", subsetOf(df.ev,df.ev2)).show()
答案 2 :(得分:0)
Spark> = 2.4.0 的另一种实现方式避免了UDF并使用内置的array_except
:
from pyspark.sql.functions import size, array_except, lit
def is_subset(a, b):
return lit(size(array_except(a, b)) == 0)
df.withColumn("is_subset", is_subset(df.ev, df.ev2))
输出:
+---+------------+------------+---------+
| id| ev| ev2|is_subset|
+---+------------+------------+---------+
|se1|[ev11, ev12]| [ev11]| false|
|se2| [ev11]|[ev11, ev12]| true|
|se3| [ev21]|[ev11, ev12]| false|
|se4|[ev21, ev22]|[ev21, ev22]| true|
+---+------------+------------+---------+
答案 3 :(得分:0)
我创建了Spark UDF:
from pyspark.sql.types import BooleanType
antecedent_inside_predictions = udf(lambda antecedent,prediction: all(elem in prediction for elem in antecedent), BooleanType())
,然后按如下方式在联接中使用它:
fp_predictions = filtered_rules.join(personal_item_recos,antecedent_inside_predictions("antecedent", "item_predictions") )
请注意,我需要启用crossJoins:
spark.conf.set('spark.sql.crossJoin.enabled', True)
(最后,我从项目中提取所需的特定项目,如下所示:
fp_predictions = fp_predictions.withColumn("ITEM_SK", fp_predictions.consequent.getItem(0))
)