在PySpark中按一列中的不同值过滤行

时间:2016-09-02 08:27:41

标签: apache-spark dataframe pyspark apache-spark-sql spark-dataframe

我们说我有下表:

+--------------------+--------------------+------+------------+--------------------+
|                host|                path|status|content_size|                time|
+--------------------+--------------------+------+------------+--------------------+
|js002.cc.utsunomi...|/shuttle/resource...|   404|           0|1995-08-01 00:07:...|
|    tia1.eskimo.com |/pub/winvn/releas...|   404|           0|1995-08-01 00:28:...|
|grimnet23.idirect...|/www/software/win...|   404|           0|1995-08-01 00:50:...|
|miriworld.its.uni...|/history/history.htm|   404|           0|1995-08-01 01:04:...|
|      ras38.srv.net |/elv/DELTA/uncons...|   404|           0|1995-08-01 01:05:...|
| cs1-06.leh.ptd.net |                    |   404|           0|1995-08-01 01:17:...|
|dialip-24.athenet...|/history/apollo/a...|   404|           0|1995-08-01 01:33:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:35:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|hsccs_gatorbox07....|/pub/winvn/releas...|   404|           0|1995-08-01 01:44:...|
|www-b2.proxy.aol....|/pub/winvn/readme...|   404|           0|1995-08-01 01:48:...|
|www-b2.proxy.aol....|/pub/winvn/releas...|   404|           0|1995-08-01 01:48:...|
+--------------------+--------------------+------+------------+--------------------+

我如何过滤此表以在PySpark中只有不同的路径? 但该表应包含所有列。

2 个答案:

答案 0 :(得分:15)

如果要保存特定列中所有值不同的行,则必须在DataFrame上调用dropDuplicates方法。 在我的例子中就像这样:

dataFrame = ... 
dataFrame.dropDuplicates(['path'])

其中路径是列名

答案 1 :(得分:0)

对于调整保留和丢弃哪些记录,如果可以将条件放入Window表达式中,则可以使用类似的内容。这是在scala中(或多或少),但是我想您也可以在PySpark中做到这一点。

val window = Window.parititionBy('columns,'to,'make,'unique).orderBy('conditionToPutRowToKeepFirst)

dataframe.withColumn(“ row_number”,row_number()。over(window))。where('row_number === 1).drop('row_number)