PySpark:根据另一个列中值的最后出现出现来填充列

时间:2019-05-13 14:34:19

标签: python apache-spark pyspark

我正在使用PySpark,根据列Status中的值填充列Code。 df由ID列排序。

唯一有效的Code值是A (Good), B (Bad), C (Neutral)

当其中一个值出现时,我希望每一行都具有相同的Status值,直到出现其他任何重要的Code值。

这是新添加的Status列的理想df输出:

+----+------+---------+
| ID | Code | Status  |
+----+------+---------+
|  1 | A    | Good    |
|  2 | 1x4  | Good    |
|  3 | B    | Bad     |
|  4 | ytyt | Bad     |
|  5 | zix8 | Bad     |
|  6 | C    | Neutral |
|  7 | 44d  | Neutral |
|  8 | A    | Good    |
+----+------+---------+

我不确定该如何解决,我找到了这个问题,但是我不知道答案是否可以适应我的需要: PySpark When item in list

我曾考虑过使用lag函数,但是A, B and C行之间的行数是不规则的,所以我不知道该怎么咬。

以下是可重复性的df:

df = sqlCtx.createDataFrame(
    [
        (1, A),
        (2, 1x4),
        (3, B),
        (4, ytyt),
        (5, zix8),
        (6, C),
        (7, 44d),
        (8, A)
    ],
    ('ID', 'Code')
)

2 个答案:

答案 0 :(得分:2)

首先使用以下函数填写有效的代码值:

Traceback (most recent call last):   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\core\handlers\exception.py", line 34, in inner
    response = get_response(request)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\core\handlers\base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\core\handlers\base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\views\decorators\csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 495, in dispatch
    response = self.handle_exception(exc)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 466, in raise_uncaught_exception
    raise exc   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)   File "C:\workspace\project\backend\project\invoicing\api\views.py", line 157, in list
    return self.get_paginated_response(serializer.data)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 768, in data
    ret = super(ListSerializer, self).data   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 262, in data
    self._data = self.to_representation(self.instance)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 686, in to_representation
    self.child.to_representation(item) for item in iterable   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 686, in <listcomp>
    self.child.to_representation(item) for item in iterable   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 530, in to_representation
    ret[field.field_name] = field.to_representation(attribute)   File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\fields.py", line 1304, in to_representation
    return value.isoformat() AttributeError: 'bytes' object has no attribute 'isoformat'

接下来使用Window函数来选择from pyspark.sql.functions col, lit, when def getStatus(code): return when(code=="A", lit("Good"))\ .when(code=="B", lit("Bad"))\ .when(code=="C", lit("Neutral")) df = df.withColumn("Status", getStatus(col("Code"))) df.show() #+---+----+-------+ #| ID|Code| Status| #+---+----+-------+ #| 1| A| Good| #| 2| 1x4| null| #| 3| B| Bad| #| 4|ytyt| null| #| 5|zix8| null| #| 6| C|Neutral| #| 7| 44d| null| #| 8| A| Good| #+---+----+-------+ 排序的"Status"的最后一个非空值。我们可以使用pyspark.sql.functions.last"ID"来选择最后一个值。

ignorenulls=True

答案 1 :(得分:1)

when与运行中的sum一起使用来定义组(从第一次出现的“ A”,“ B”或“ C”代码到按{{1}的顺序出现的下一个行) }。然后使用id中分类组的first值获取状态列。

when