我有这个数据框:
+----------+-------+----+-----------+
| Date|Count | GR|Count_NEW |
+----------+-------+----+-----------+
|2012-01-02| 25| 100| 25|
|2012-01-02| 250| 110| 250|
|2012-01-03| 26| 100| 26|
|2012-01-03| 251| 110| 251|
|2012-01-04| 24| 100| 24|
|2012-01-04| 242| 110| 242|
|2012-01-05| 26| 100| 26|
|2012-01-05| 254| 110| 254|
|2012-01-06| 0| 100| 0|
|2012-01-06| 254| 110| 254|
|2012-01-07| 25| 100| 25|
|2012-01-07| 256| 110| 256|
|2012-01-08| 28| 100| 28|
|2012-01-08| 0| 110| 0|
|2012-01-09| 22| 100| 22|
|2012-01-09| 289| 110| 289|
|2012-01-10| 29| 100| 29|
|2012-01-10| 276| 110| 276|
|2012-01-11| 21| 100| 21|
|2012-01-11| 259| 110| 259|
+----------+-------+----+-----------+
您可以使用它来创建DF:
l = [
('100', '2012-01-02', 25),
('110', '2012-01-02', 250),
('100', '2012-01-03', 26),
('110', '2012-01-03', 251),
('100', '2012-01-04', 24),
('110', '2012-01-04', 242),
('100', '2012-01-05', 26),
('110', '2012-01-05', 254),
('100', '2012-01-06', 0),
('110', '2012-01-06', 254),
('100', '2012-01-07', 25),
('110', '2012-01-07', 256),
('100', '2012-01-08', 28),
('110', '2012-01-08', 0),
('100', '2012-01-09', 22),
('110', '2012-01-09', 289),
('100', '2012-01-10', 29),
('110', '2012-01-10', 276),
('100', '2012-01-11', 21),
('110', '2012-01-11', 259),
('100', '2012-01-12', 32),
('110', '2012-01-12', 280),
('100', '2012-01-13', 39),
('110', '2012-01-13', 290)
]
rdd = sparkc.parallelize(l)
member = rdd.map(lambda x: Row(GR=x[0], Date=x[1], Count=int(x[2])))
pdf = sqlContext.createDataFrame(member)
Count和Count_NEW是同一列(因此请忽略Count_NEW)。
我想为2012年6月GR = 100填充26 我想将2012年1月GR = 110填充为256
所以看起来像这样...
+----------+-------+----+-----------+
| Date|Count | GR|Count_NEW |
+----------+-------+----+-----------+
|2012-01-02| 25| 100| 25|
|2012-01-02| 250| 110| 250|
|2012-01-03| 26| 100| 26|
|2012-01-03| 251| 110| 251|
|2012-01-04| 24| 100| 24|
|2012-01-04| 242| 110| 242|
|2012-01-05| 26| 100| 26|
|2012-01-05| 254| 110| 254|
|2012-01-06| 26| 100| 0|
|2012-01-06| 254| 110| 254|
|2012-01-07| 25| 100| 25|
|2012-01-07| 256| 110| 256|
|2012-01-08| 28| 100| 28|
|2012-01-08| 256| 110| 0|
|2012-01-09| 22| 100| 22|
|2012-01-09| 289| 110| 289|
|2012-01-10| 29| 100| 29|
|2012-01-10| 276| 110| 276|
|2012-01-11| 21| 100| 21|
|2012-01-11| 259| 110| 259|
+----------+-------+----+-----------+
这意味着我想用以前的非零值填充...如何使用窗口函数来做到这一点?
我尝试了这个,但是不起作用...
win = Window.partitionBy("GR").orderBy("Date")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
df1 = df1.withColumn("Count", last('Count', True).over(win))
我们非常感谢您的帮助。
@corgiman回答后(非常感谢您的时间和帮助)...
如果数据框是这样的...那么@corgiman的soln不起作用
+-----+----------+---+---------+
|Count| Date| GR|Count_NEW|
+-----+----------+---+---------+
| 25|2012-01-02|100| 25|
| 250|2012-01-02|110| 250|
| 26|2012-01-03|100| 26|
| 251|2012-01-03|110| 251|
| 24|2012-01-04|100| 24|
| 242|2012-01-04|110| 242|
| 26|2012-01-05|100| 26|
| 254|2012-01-05|110| 254|
| 0|2012-01-06|100| 0|
| 254|2012-01-06|110| 254|
| 0|2012-01-07|100| 0|
| 256|2012-01-07|110| 256|
| 28|2012-01-08|100| 28|
| 0|2012-01-08|110| 0|
| 22|2012-01-09|100| 22|
| 289|2012-01-09|110| 289|
| 29|2012-01-10|100| 29|
| 276|2012-01-10|110| 276|
| 21|2012-01-11|100| 21|
| 259|2012-01-11|110| 259|
+-----+----------+---+---------+
这里GR = 100在2012-01-07和2012-01-06上为0,我希望两者都填充上一个非零值,即在2012-01-05上为26。
所以所需的解决方案就是这个...
+-----+----------+---+---------+
|Count| Date| GR|Count_NEW|
+-----+----------+---+---------+
| 250|2012-01-02|110| 250|
| 251|2012-01-03|110| 251|
| 242|2012-01-04|110| 242|
| 254|2012-01-05|110| 254|
| 254|2012-01-06|110| 254|
| 256|2012-01-07|110| 256|
| 0|2012-01-08|110| 256|
| 289|2012-01-09|110| 289|
| 276|2012-01-10|110| 276|
| 259|2012-01-11|110| 259|
| 280|2012-01-12|110| 280|
| 290|2012-01-13|110| 290|
| 25|2012-01-02|100| 25|
| 26|2012-01-03|100| 26|
| 24|2012-01-04|100| 24|
| 26|2012-01-05|100| 26|
| 0|2012-01-06|100| 26|
**| 0|2012-01-07|100| 26|**
| 28|2012-01-08|100| 28|
| 22|2012-01-09|100| 22|
+-----+----------+---+---------+
但这是...
+-----+----------+---+---------+
|Count| Date| GR|Count_NEW|
+-----+----------+---+---------+
| 250|2012-01-02|110| 250|
| 251|2012-01-03|110| 251|
| 242|2012-01-04|110| 242|
| 254|2012-01-05|110| 254|
| 254|2012-01-06|110| 254|
| 256|2012-01-07|110| 256|
| 0|2012-01-08|110| 256|
| 289|2012-01-09|110| 289|
| 276|2012-01-10|110| 276|
| 259|2012-01-11|110| 259|
| 280|2012-01-12|110| 280|
| 290|2012-01-13|110| 290|
| 25|2012-01-02|100| 25|
| 26|2012-01-03|100| 26|
| 24|2012-01-04|100| 24|
| 26|2012-01-05|100| 26|
| 0|2012-01-06|100| 26|
*| 0|2012-01-07|100| 0|*
| 28|2012-01-08|100| 28|
| 22|2012-01-09|100| 22|
+-----+----------+---+---------+
答案 0 :(得分:1)
您可以将0
的值更改为null
,并在ignorenulls
方法中使用last
自变量。
示例代码:
pdf = pdf.withColumn('Count', F.when(pdf['Count'] == 0, F.lit(None)).otherwise(pdf['Count']))
win = Window.partitionBy("GR").orderBy("Date")
s = F.last('Count', ignorenulls = True).over(win)
pdf = pdf.withColumn("Count", F.when(pdf['Count'] == F.lag('Count').over(win), s).otherwise(s)
pdf.show()
输出将是:
+---+----------+-----+
| GR| Date|Count|
+---+----------+-----+
|110|2012-01-02| 250|
|110|2012-01-03| 251|
|110|2012-01-04| 242|
|110|2012-01-05| 254|
|110|2012-01-06| 254|
|110|2012-01-07| 256|
|110|2012-01-08| 256|
|110|2012-01-09| 289|
|110|2012-01-10| 276|
|110|2012-01-11| 259|
|110|2012-01-12| 280|
|110|2012-01-13| 290|
|100|2012-01-02| 25|
|100|2012-01-03| 26|
|100|2012-01-04| 24|
|100|2012-01-05| 26|
|100|2012-01-06| 26|
|100|2012-01-07| 25|
|100|2012-01-08| 28|
|100|2012-01-09| 22|
+---+----------+-----+
答案 1 :(得分:0)
使用when
和otherwise
将可以得到想要的东西。
您只需要从以下位置更改代码即可
win = Window.partitionBy("GR").orderBy("Date")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
df1 = df1.withColumn("Count", last('Count', True).over(win))
收件人:
win = Window.partitionBy("GR").orderBy("Date")\
.rowsBetween(Window.unboundedPreceding, -1)
df1 = df1.withColumn("Count_new", F.when(df1.Count==0, F.last('Count', True).over(win)).otherwise(pdf.Count))
输出将是:
+-----+----------+---+---------+
|Count| Date| GR|Count_new|
+-----+----------+---+---------+
| 250|2012-01-02|110| 250|
| 251|2012-01-03|110| 251|
| 242|2012-01-04|110| 242|
| 254|2012-01-05|110| 254|
| 254|2012-01-06|110| 254|
| 256|2012-01-07|110| 256|
| 0|2012-01-08|110| 256|
| 289|2012-01-09|110| 289|
| 276|2012-01-10|110| 276|
| 259|2012-01-11|110| 259|
| 280|2012-01-12|110| 280|
| 290|2012-01-13|110| 290|
| 25|2012-01-02|100| 25|
| 26|2012-01-03|100| 26|
| 24|2012-01-04|100| 24|
| 26|2012-01-05|100| 26|
| 0|2012-01-06|100| 26|
| 25|2012-01-07|100| 25|
| 28|2012-01-08|100| 28|
| 22|2012-01-09|100| 22|
+-----+----------+---+---------+
答案 2 :(得分:0)
谢谢你们俩。在此期间,我尝试了此方法,并且效果很好(这与Solat的想法类似-将0转换为null)。这就是我所做的,效果很好!
df1 = pdf.withColumn("Count_NEW", \
when(~isnan("Count") & col("Count").isNotNull()\
& (col("Count") > 0), col("Count"))\
.otherwise(None) )
win = Window.partitionBy("GR").orderBy("Date")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
df1 = df1.withColumn("Count_NEW", last('Count_NEW', True).over(win))