Question

我知道这只是一个非常简单的问题，而且很可能已经在某个地方得到了回答，但作为一个初学者，我仍然没有得到它并且正在寻找你的启示，请提前感谢你：

我有一个临时数据框：

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |

我需要的是删除主机列中的所有冗余项，换句话说，我需要获得最终的不同结果，如：

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |

Answer 1

如果 df 是您的DataFrame的名称，有两种方法可以获得唯一的行：

df2 = df.distinct()

或

df2 = df.drop_duplicates()

Answer 2

普通的区号不是那么用户友好，因为您不能设置该列。在这种情况下，您就足够了：

df = df.distinct()

但是如果日期列中还有其他值，则不会从主机中获取不同的元素：

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
|     uplherc.upl.com|  1|
+--------------------+---+

与众不同之后，您将返回如下：

df.distinct().show()

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
+--------------------+---+

因此您应该使用此：

df = df.dropDuplicates(['host'])

它将保留天的第一个值

如果您熟悉SQL语言，它也将为您服务：

df.createOrReplaceTempView("temp_table")
new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")

 +--------------------+-----------------+
|  first(host, false)|first(day, false)|
+--------------------+-----------------+
|   in24.inetnebr.com|                1|
|ix-esc-ca2-07.ix....|                1|
|     uplherc.upl.com|                1|
+--------------------+-----------------+

如何使用pyspark获取数据框中的不同行？

2 个答案: