Question

使用＆＃34;或＆＃34;时出错过滤数据帧。以下是代码：

df.select("InvoiceNo","Description").where((col("InvoiceNo") !== 536365) || (col("UnitPrice") > 600))

我尝试使用＆＃34;或＆＃34;但也得到了同样的错误。

df.select("InvoiceNo","Description").where((col("InvoiceNo") !== 536365).or(col("UnitPrice") > 600))

错误：

 org.apache.spark.sql.AnalysisException: cannot resolve 'UnitPrice' given input columns: [InvoiceNo, Description]

我哪里可能出错了？请帮助。

Answer 1

就relational algebra而言，当您使用Spark SQL执行选择（select）时，这会缩小您选择的列。

因此，您将无法调用未选择执行投影的投标（where，filter）。

逻辑与常规SQL逻辑略有不同，因此主要在您的情况下，您需要执行以下操作：

val df2 = df
 .where((col("InvoiceNo") !== 536365).or(col("UnitPrice") > 600)) // projection (π)
 .select("InvoiceNo","Description") // selection (σ)

Answer 2

您只选择了两列InvoiceNo, Description，代码会尝试根据您选择的UnitPrice不存在的列来过滤。

您可以尝试以下操作：

df.select("InvoiceNo","Description","UnitPrice").where((col("InvoiceNo") !== 536365).or(col("UnitPrice") > 600))

如果您需要选择特定列，请在select之后使用where。

df.where((col("InvoiceNo") !== 536365).or(col("UnitPrice") > 600)).select("InvoiceNo","Description","UnitPrice")