Question

说我有一个名为“ orderitems”的数据框，具有以下架构

    DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity: int, order_item_subtotal: float, order_item_product_price: float]

因此，作为检查数据质量的一部分，我需要确保所有行均满足以下公式： order_item_subtotal =（order_item_quantity * order_item_product_price）。为此，我需要添加一个单独的列“ valid”，对于满足上述公式的所有行，其值应为“ Y”；对于所有其他行，其值应为“ N”。我决定使用when（）和else（）以及withColumn（）方法，如下所示。

    orderitems.withColumn("valid",when(orderitems.order_item_subtotal != (orderitems.order_item_product_price * orderitems.order_item_quantity),'N').otherwise("Y"))

但是它在错误下面返回我：

    TypeError: 'Column' object is not callable

我知道发生这种情况是因为我试图将两个列对象相乘。但是我不确定如何解决这个问题，因为我仍然处于学习过程中。我想知道，该如何解决。我正在使用Spark 2.3.0和Python

Answer 1

尝试这样的事情：

from pyspark.sql.functions import col,when
orderitems.withColumn("valid",
          when(col("order_item_subtotal") != (col("order_item_product_price") * col("order_item_quantity")),"N")
          .otherwise("Y")).show()

Answer 2

这可以通过火花UDF函数实现，该函数在执行行操作时非常有效。在运行此代码之前，请确保您进行的比较应具有相同的数据类型。

select contacts.title 
from contacts left join exclude on contacts.title like '*' & exclude.string & '*'
where exclude.string is null

如何在Spark数据帧中将两列相乘

2 个答案: