Question

我有一个包含两列的数据框：

+--------+-----+
|    col1| col2|
+--------+-----+
|22      | 12.2|
|1       |  2.1|
|5       | 52.1|
|2       | 62.9|
|77      | 33.3|

我想创建一个新的数据框，该数据框将仅包含

行

“ col1的值”>“ col2的值”

请注意， col1类型为长，而 col2类型为double ，

结果应该是这样的：

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

Answer 1

我认为最好的方法是简单地使用“过滤器”。

df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

Answer 2

另一种可能的方法是使用DF的where函数。

例如：

val output = df.where("col1>col2")

将为您带来预期的结果：

+----+----+
|col1|col2|
+----+----+
|  22|12.2|
|  77|33.3|
+----+----+

Answer 3

您可以使用sqlContext简化挑战。

首先注册为临时表，例如： df.createOrReplaceTempView("tbl1") 然后像这样运行sql sqlContext.sql("select * from tbl1 where col1 > col2")

Answer 4

根据条件保留行的最佳方法是使用 filter，正如其他人提到的那样。

要回答标题中所述的问题，根据条件删除行的一种选择是在 Pyspark 中使用 left_anti join。例如删除 col1>col2 使用的所有行：

rows_to_delete = df.filter(df.col1>df.col2)

df_with_rows_deleted = df.join(rows_to_delete, on=[key_column], how='left_anti')