Question

我正在使用pyspark我有一张这样的表：

id |  ClientNum  | Value |      Date     | Age   |   Country  |   Job
 1 |      19     |   A   |   1483695000  |  21   |    null    |   null
 2 |      19     |   A   |   1483696500  |  21   |    France  |   null
 3 |      19     |   A   |   1483697800  |  21   |    France  |  Engineer
 4 |      19     |   B   |   1483699000  |  21   |    null    |   null
 5 |      19     |   B   |   1483699500  |  21   |    France  |   null
 6 |      19     |   B   |   1483699800  |  21   |    France  |  Engineer
 7 |      24     |   C   |   1483699200  |  null |    null    |   null
 8 |      24     |   D   |   1483699560  |  28   |    Spain   |   null
 9 |      24     |   D   |   1483699840  |  28   |    Spain   |  Student

根据值列，我希望为每个ClientNum保留指定最多信息（年龄，国家/地区，作业）的不同值。

结果应该是这样的：

   ClientNum  | Value |      Date     | Age   |   Country  |   Job
       19     |   A   |   1483697800  |  21   |    France  |  Engineer
       19     |   B   |   1483699800  |  21   |    France  |  Engineer
       24     |   C   |   1483699200  | null  |    null    |   null
       24     |   D   |   1483699840  |  28   |    Spain   |  Student

谢谢！

Answer 1

这是一种方法，使用udf计算每行的非空值数，然后使用Window函数过滤数据：

让我们首先定义以udf列作为参数的array，并为我们提供非空值的数量作为结果。

from pyspark.sql.functions import array

def nullcounter(arr):

  res = [x for x in arr if x != None]
  return(len(res))

nullcounter_udf = udf(nullcounter)

我们将此列添加到您的数据中：

df = df.withColumn("counter", nullcounter_udf(array(df.columns)))

现在我们可以按ClientNum和Value对您的数据进行分区，并保留counter值最高的行：

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col

window = Window.partitionBy(df['ClientNum'], df['Value']).orderBy(df['counter'].desc())

df.select('*', rank().over(window).alias('rank')) \
  .filter(col('rank') == 1) \
  .sort('Value') \
  .show() 
+---+---------+-----+----------+----+-------+--------+-------+----+
| id|ClientNum|Value|      Date| Age|Country|     Job|counter|rank|
+---+---------+-----+----------+----+-------+--------+-------+----+
|  3|       19|    A|1483697800|  21| France|Engineer|      8|   1|
|  6|       19|    B|1483699800|  21| France|Engineer|      8|   1|
|  7|       24|    C|1483699200|null|   null|    null|      5|   1|
|  9|       24|    D|1483699840|  28|  Spain| Student|      8|   1|
+---+---------+-----+----------+----+-------+--------+-------+----+

数据

df = sc.parallelize([(1, 19, "A", 1483695000, 21, None, None), (2, 19, "A", 1483696500, 21, "France", None), (3, 19, "A", 1483697800, 21, "France", "Engineer"), (4, 19, "B", 1483699000, 21, None, None), (5, 19, "B", 1483699500, 21, "France", None), (6, 19, "B", 1483699800, 21, "France", "Engineer"), (7, 24, "C", 1483699200, None, None, None), (8, 24, "D", 1483699560, 28, "Spain", None), (9, 24, "D", 1483699840, 28, "Spain", "Student")]).toDF(["id","ClientNum","Value","Date","Age", "Country", "Job"])

Answer 2

试试这个：

    val df = Your_data_frame.registerTempTable("allData") // register your dataframe as a temp table

// we are finding max of date for each clientNum and value and join back to the original table.  

    sqlContext.sql("select a.ClientNum, a.Value, a.Date, a.Age, a.Country, a.Job from allData a
    join
    (select ClientNum, Value, max(Date) as max_date from allData group by ClientNum, Value) b
    on a.ClientNum = b.ClientNum and a.Value = b.Value and a.Date = b.max_date").show

Answer 3

如果像我一样，你有其他答案的麻烦，这是我使用UDF（Spark 2.2.0）的Python解决方案：

让我们创建一个虚拟数据集：

llist = [(1, 'alice', 'some_field', 'some_field', 'some_field', None), (30, 'bob', 'some_field', None, None, 10), (3, 'charles', 'some_field', None, 'some_other_field', 1111)]
df = sqlContext.createDataFrame(llist, ['id', 'name','field1','field2', 'field3', 'field4'])

df.show()

+---+-------+----------+----------+----------------+------+
| id|   name|    field1|    field2|          field3|field4|
+---+-------+----------+----------+----------------+------+
|  1|  alice|some_field|some_field|      some_field|  null|
| 30|    bob|some_field|      null|            null|    10|
|  3|charles|some_field|      null|some_other_field|  1111|
+---+-------+----------+----------+----------------+------+

让我们定义我们的UDF来计算None值：

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import struct, udf

count_empty_columns = udf(
                        lambda row: len([x for x in row if x is None]), 
                        IntegerType()
                      )

我们可以根据该UDF添加新列null_count：

df = df.withColumn('null_count',
        count_empty_columns(struct([df[x] for x in df.columns])))

df.show()

+---+-------+----------+----------+----------------+------+----------+
| id|   name|    field1|    field2|          field3|field4|null_count|
+---+-------+----------+----------+----------------+------+----------+
|  1|  alice|some_field|some_field|      some_field|  null|         1|
| 30|    bob|some_field|      null|            null|    10|         2|
|  3|charles|some_field|      null|some_other_field|  1111|         1|
+---+-------+----------+----------+----------------+------+----------+

最后过滤：

df = df.filter(df['null_count'] <= 1)

Pyspark：根据每行的空值数过滤Dataframe

3 个答案: