问:有没有办法用相同列中其他行的值替换指定列中的空值。
我想替换位于指定列中具有smae值的两行之间的空值
Orignal DF Desired DF
+-----+----+-----+-----+----+----------+ +-----+----+-----+-----+----+----------+
|namea|Exam|name1|math1|phy1|Prev_Rank1| |namea|Exam|name1|math1|phy1|Prev_Rank1|
+-----+----+-----+-----+----+----------+ +-----+----+-----+-----+----+----------+
|Ahmad| 48| null| null|null| null| |Ahmad| 48| null| null|null| null|
|Ahmad| 49| null| null|null| null| |Ahmad| 49| null| null|null| null|
|Ahmad| 50|Ahmad| 50| 54| 3| |Ahmad| 50|Ahmad| 50| 54| 3|
|Ahmad| 51| null| null|null| null| |Ahmad| 51 Ahmad| 50| 54| 3|
|Ahmad| 53| null| null|null| null| |Ahmad| 53|Ahmad| 50| 54| 3|
|Ahmad| 54|Ahmad| 50| 54| 3| >>>>>>>> |Ahmad| 54|Ahmad| 50| 54| 3|
|Ahmad| 88| null| null|null| null| >>>>>>>> |Ahmad| 88| null| null|null| null|
|Ahmad| 90|Ahmad| 100| 90| 2| |Ahmad| 90|Ahmad| 100| 90| 2|
|Ahmad| 95| null| null|null| null| |Ahmad| 95|Ahmad| 100| 90| 2|
|Ahmad| 100|Ahmad| 100| 90| 2| |Ahmad| 100|Ahmad| 100| 90| 2|
|Ahmad| 101| null| null|null| null| |Ahmad| 101| null| null|null| null|
| Emma| 52| Emma| 52| 85| 1| | Emma| 52| Emma| 52| 85| 1|
| Emma| 85| Emma| 52| 85| 1| | Emma| 85| Emma| 52| 85| 1|
+-----+----+-----+-----+----+----------+ +-----+----+-----+-----+----+----------+
我尝试使用以下步骤替换null值:
DF7=Orignal_DF.withColumn("name1", fn.last('name1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize,0)))
DF7=DF7.withColumn("math1", fn.last('math1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))
DF7=DF7.withColumn("phy1", fn.last('phy1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))
DF7=DF7.withColumn("Prev_Rank1", fn.last('Prev_Rank1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))
结果Df是:
+-----+----+-----+-----+----+----------+
|namea|Exam|name1|math1|phy1|Prev_Rank1|
+-----+----+-----+-----+----+----------+
|Ahmad| 48| null| null|null| null|
|Ahmad| 49| null| null|null| null|
|Ahmad| 50|Ahmad| 50| 54| 3|
|Ahmad| 51|Ahmad| 50| 54| 3|
|Ahmad| 53|Ahmad| 50| 54| 3|
|Ahmad| 54|Ahmad| 50| 54| 3|
|Ahmad| 88|Ahmad| 50| 54| 3|
|Ahmad| 90|Ahmad| 100| 90| 2|
|Ahmad| 95|Ahmad| 100| 90| 2|
|Ahmad| 100|Ahmad| 100| 90| 2|
|Ahmad| 101|Ahmad| 100| 90| 2|
| Emma| 52| Emma| 52| 85| 1|
| Emma| 85| Emma| 52| 85| 1|
+-----+----+-----+-----+----+----------+
答案 0 :(得分:2)
这是一个可能的解决方案:
采用的方法是将数据框的每一列转换为一个列表。然后,对于每个列表,平滑由重复值限制的那些空值。然后将列表重新组合成结果列表,然后将其转换回数据帧。
设置测试数据集如下:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('testSmothing').getOrCreate()
df=spark.createDataFrame(data=[(None,'A',3)\
,(None,'B',None)\
,(3,None,None)\
,(None,'C',-4)\
,(None,'D',-2)\
,(3,None,None)\
,(4,None,None)\
,(5,'G',-2)\
,(6,'H',-1)\
,(None,'I',-1)\
,(None,None,-1)\
,(8,'I',-1)\
,(9,'J',-1)]\
,schema=['x1','x2','x3'])
df.show()
+----+----+----+
| x1| x2| x3|
+----+----+----+
|null| A| 3|
|null| B|null|
| 3|null|null|
|null| C| -4|
|null| D| -2|
| 3|null|null|
| 4|null|null|
| 5| G| -2|
| 6| H| -1|
|null| I| -1|
|null|null| -1|
| 8| I| -1|
| 9| J| -1|
+----+----+----+
辅助功能1:
在当前子列表中 - 检查空值是否以相同的值为界:
def isRepeatBound(tempList):
count=0
startElt=tempList[0]
for elt in tempList:
if count < len(tempList)-1:
count=count+1
if(elt == None and tempList[count]== startElt):
return True
return False
辅助功能2:
平滑当前列表:
def smoothLst(lst):
valueFound =False
colLst=[]
index=0
smooth=False
for elt in lst:
if (index ==0):
if(elt is None and valueFound == False):
colLst.append(elt)
elif(elt is None and valueFound == True):
if(smooth == True):
colLst.append(lastValue)
else:
colLst.append(elt)
elif(index < len(lst)-1 and isRepeatBound(lst[index:])==True):
smooth=True
lastValue=elt
valueFound=True
colLst.append(elt)
else:
smooth=False
valueFound=False
colLst.append(elt)
else:
if(elt is None and valueFound == False):
colLst.append(elt)
elif(elt is None and valueFound == True):
if(smooth == True):
colLst.append(lastValue)
else:
colLst.append(elt)
elif(index < len(lst)-1 and isRepeatBound(lst[index:])==True):
smooth=True
lastValue=elt
valueFound=True
colLst.append(elt)
else:
smooth=False
valueFound=False
colLst.append(elt)
index = index +1
return colLst
主要()强>
迭代所有列。对于每列,将其转换为列表。使用上面的辅助函数平滑列表。将平滑列表存储在主列表中,该列表将转置并转换回最终结果DF。
colNames = df.schema.names
resultlst=[]
for name in colNames:
tmpDict={}
lst = df.select(df[name]).rdd.flatMap(lambda x: x).collect()
smoothList = smoothLst(lst)
resultlst.append(smoothList)
transposeResultLst=list(map(list, zip(*resultlst)))
resultDF = spark.sparkContext.parallelize(transposeResultLst).toDF(['x1','x2','x3'])
resultDF.show()
+----+----+----+
| x1| x2| x3|
+----+----+----+
|null| A| 3|
|null| B|null|
| 3|null|null|
| 3| C| -4|
| 3| D| -2|
| 3|null| -2|
| 4|null| -2|
| 5| G| -2|
| 6| H| -1|
| 6| I| -1|
| 6| I| -1|
| 8| I| -1|
| 9| J| -1|
+----+----+----+