我正在使用PySpark,根据列Status
中的值填充列Code
。 df由ID
列排序。
唯一有效的Code
值是A (Good), B (Bad), C (Neutral)
。
当其中一个值出现时,我希望每一行都具有相同的Status
值,直到出现其他任何重要的Code
值。
这是新添加的Status
列的理想df输出:
+----+------+---------+
| ID | Code | Status |
+----+------+---------+
| 1 | A | Good |
| 2 | 1x4 | Good |
| 3 | B | Bad |
| 4 | ytyt | Bad |
| 5 | zix8 | Bad |
| 6 | C | Neutral |
| 7 | 44d | Neutral |
| 8 | A | Good |
+----+------+---------+
我不确定该如何解决,我找到了这个问题,但是我不知道答案是否可以适应我的需要: PySpark When item in list
我曾考虑过使用lag函数,但是A, B and C
行之间的行数是不规则的,所以我不知道该怎么咬。
以下是可重复性的df:
df = sqlCtx.createDataFrame(
[
(1, A),
(2, 1x4),
(3, B),
(4, ytyt),
(5, zix8),
(6, C),
(7, 44d),
(8, A)
],
('ID', 'Code')
)
答案 0 :(得分:2)
首先使用以下函数填写有效的代码值:
Traceback (most recent call last): File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\core\handlers\exception.py", line 34, in inner
response = get_response(request) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\core\handlers\base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\core\handlers\base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\django\views\decorators\csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 495, in dispatch
response = self.handle_exception(exc) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 466, in raise_uncaught_exception
raise exc File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\views.py", line 492, in dispatch
response = handler(request, *args, **kwargs) File "C:\workspace\project\backend\project\invoicing\api\views.py", line 157, in list
return self.get_paginated_response(serializer.data) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 768, in data
ret = super(ListSerializer, self).data File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 262, in data
self._data = self.to_representation(self.instance) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 686, in to_representation
self.child.to_representation(item) for item in iterable File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 686, in <listcomp>
self.child.to_representation(item) for item in iterable File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\serializers.py", line 530, in to_representation
ret[field.field_name] = field.to_representation(attribute) File "C:\Users\ronny\.virtualenvs\project-vDixCJD0\lib\site-packages\rest_framework\fields.py", line 1304, in to_representation
return value.isoformat() AttributeError: 'bytes' object has no attribute 'isoformat'
接下来使用Window函数来选择from pyspark.sql.functions col, lit, when
def getStatus(code):
return when(code=="A", lit("Good"))\
.when(code=="B", lit("Bad"))\
.when(code=="C", lit("Neutral"))
df = df.withColumn("Status", getStatus(col("Code")))
df.show()
#+---+----+-------+
#| ID|Code| Status|
#+---+----+-------+
#| 1| A| Good|
#| 2| 1x4| null|
#| 3| B| Bad|
#| 4|ytyt| null|
#| 5|zix8| null|
#| 6| C|Neutral|
#| 7| 44d| null|
#| 8| A| Good|
#+---+----+-------+
排序的"Status"
的最后一个非空值。我们可以使用pyspark.sql.functions.last
和"ID"
来选择最后一个值。
ignorenulls=True
答案 1 :(得分:1)
将when
与运行中的sum
一起使用来定义组(从第一次出现的“ A”,“ B”或“ C”代码到按{{1}的顺序出现的下一个行) }。然后使用id
中分类组的first
值获取状态列。
when