下面的代码段尝试执行以下操作:
对于customer_code
中的每个sdf1
,检查此客户代码是否出现在sdf2
中。如果是这样,请将df1.actual_related_customer
替换为df2.actual_related_customer
。
此代码不起作用,因为我错误地访问了df2
中的行。我如何实现上述目标? (如果您有其他建议而不是索引,请射击!)
sdf1 = sqlCtx.createDataFrame(
[
('customer1', 'customer_code1', 'other'),
('customer2', 'customer_code2', 'other'),
('customer3', 'customer_code3', 'other'),
('customer4', 'customer_code4', 'other')
],
('actual_related_customer', 'customer_code', 'other')
)
sdf2 = sqlCtx.createDataFrame(
[
('Peter', 'customer_code1'),
('Deran', 'customer_code5'),
('Christopher', 'customer_code3'),
('Nick', 'customer_code4')
],
('actual_related_customer', 'customer_code')
)
def right_customer(x,y):
for row in sdf2.collect() :
if x == row['customer_code'] :
return row['actual_related_customer']
return y
fun1 = udf(right_customer, StringType())
test = sdf1.withColumn(
"actual_related_customer",
fun1(sdf1.customer_code, sdf1.actual_related_customer)
)
我想要的输出如下:
desired_output = sqlCtx.createDataFrame(
[
('Peter', 'customer_code1', 'other'),
('customer2', 'customer_code2', 'other'),
('Christopher', 'customer_code3', 'other'),
('Nick', 'customer_code4', 'other')
],
('actual_related_customer', 'customer_code', 'other')
)
答案 0 :(得分:0)
让我们逐步进行操作:
首先用actual_related_customer
重命名sdf1中的actual_1
,然后用actual_related_customer
重命名sdf2中的actual_2
:
sdf1=sdf1.withColumnRenamed('actual_related_customer', 'actual_1')
sdf2=sdf2.withColumnRenamed('actual_related_customer', 'actual_2')
然后加入他们:
sdf1= sdf1.join(sdf2, on='customer_code', how='left')
sdf1.show()
输出:
+--------------+---------+-----+-----------+
| customer_code| actual_1|other| actual_2|
+--------------+---------+-----+-----------+
|customer_code4|customer4|other| Nick|
|customer_code2|customer2|other| null|
|customer_code3|customer3|other|Christopher|
|customer_code1|customer1|other| Peter|
+--------------+---------+-----+-----------+
现在将逻辑添加到sdf1
:
sdf1= sdf1.withColumn('actual_related_customer', F.when(sdf1.actual_2.isNotNull(), sdf1.actual_2).otherwise(sdf1.actual_1))
最后显示您想要的内容:
sdf1.select('customer_code', 'other', 'actual_related_customer').show()
输出:
+--------------+-----+-----------------------+
| customer_code|other|actual_related_customer|
+--------------+-----+-----------------------+
|customer_code4|other| Nick|
|customer_code2|other| customer2|
|customer_code3|other| Christopher|
|customer_code1|other| Peter|
+--------------+-----+-----------------------+