Question

问：有没有办法用相同列中其他行的值替换指定列中的空值。

我想替换位于指定列中具有smae值的两行之间的空值

               Orignal DF                                          Desired DF
+-----+----+-----+-----+----+----------+            +-----+----+-----+-----+----+----------+
|namea|Exam|name1|math1|phy1|Prev_Rank1|            |namea|Exam|name1|math1|phy1|Prev_Rank1|
+-----+----+-----+-----+----+----------+            +-----+----+-----+-----+----+----------+
|Ahmad|  48| null| null|null|      null|            |Ahmad|  48| null| null|null|      null|
|Ahmad|  49| null| null|null|      null|            |Ahmad|  49| null| null|null|      null|
|Ahmad|  50|Ahmad|   50|  54|         3|            |Ahmad|  50|Ahmad|   50|  54|         3|
|Ahmad|  51| null| null|null|      null|            |Ahmad|  51 Ahmad|   50|  54|         3|
|Ahmad|  53| null| null|null|      null|            |Ahmad|  53|Ahmad|   50|  54|         3|
|Ahmad|  54|Ahmad|   50|  54|         3|  >>>>>>>>  |Ahmad|  54|Ahmad|   50|  54|         3|
|Ahmad|  88| null| null|null|      null|  >>>>>>>>  |Ahmad|  88| null| null|null|      null|
|Ahmad|  90|Ahmad|  100|  90|         2|            |Ahmad|  90|Ahmad|  100|  90|         2|
|Ahmad|  95| null| null|null|      null|            |Ahmad|  95|Ahmad|  100|  90|         2|
|Ahmad| 100|Ahmad|  100|  90|         2|            |Ahmad| 100|Ahmad|  100|  90|         2|
|Ahmad| 101| null| null|null|      null|            |Ahmad| 101| null| null|null|      null|
| Emma|  52| Emma|   52|  85|         1|            | Emma|  52| Emma|   52|  85|         1|
| Emma|  85| Emma|   52|  85|         1|            | Emma|  85| Emma|   52|  85|         1|
+-----+----+-----+-----+----+----------+            +-----+----+-----+-----+----+----------+

我尝试使用以下步骤替换null值：

DF7=Orignal_DF.withColumn("name1", fn.last('name1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize,0)))
DF7=DF7.withColumn("math1", fn.last('math1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))
DF7=DF7.withColumn("phy1", fn.last('phy1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))
DF7=DF7.withColumn("Prev_Rank1", fn.last('Prev_Rank1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))

结果Df是：

+-----+----+-----+-----+----+----------+
|namea|Exam|name1|math1|phy1|Prev_Rank1|
+-----+----+-----+-----+----+----------+
|Ahmad|  48| null| null|null|      null|
|Ahmad|  49| null| null|null|      null|
|Ahmad|  50|Ahmad|   50|  54|         3|
|Ahmad|  51|Ahmad|   50|  54|         3|
|Ahmad|  53|Ahmad|   50|  54|         3|
|Ahmad|  54|Ahmad|   50|  54|         3|
|Ahmad|  88|Ahmad|   50|  54|         3|
|Ahmad|  90|Ahmad|  100|  90|         2|
|Ahmad|  95|Ahmad|  100|  90|         2|
|Ahmad| 100|Ahmad|  100|  90|         2|
|Ahmad| 101|Ahmad|  100|  90|         2|
| Emma|  52| Emma|   52|  85|         1|
| Emma|  85| Emma|   52|  85|         1|
+-----+----+-----+-----+----+----------+

Answer 1

这是一个可能的解决方案：

采用的方法是将数据框的每一列转换为一个列表。然后，对于每个列表，平滑由重复值限制的那些空值。然后将列表重新组合成结果列表，然后将其转换回数据帧。

设置测试数据集如下：

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('testSmothing').getOrCreate()
df=spark.createDataFrame(data=[(None,'A',3)\
                               ,(None,'B',None)\
                               ,(3,None,None)\
                               ,(None,'C',-4)\
                               ,(None,'D',-2)\
                               ,(3,None,None)\
                               ,(4,None,None)\
                               ,(5,'G',-2)\
                               ,(6,'H',-1)\
                               ,(None,'I',-1)\
                               ,(None,None,-1)\
                               ,(8,'I',-1)\
                               ,(9,'J',-1)]\
                               ,schema=['x1','x2','x3'])

df.show()

+----+----+----+
|  x1|  x2|  x3|
+----+----+----+
|null|   A|   3|
|null|   B|null|
|   3|null|null|
|null|   C|  -4|
|null|   D|  -2|
|   3|null|null|
|   4|null|null|
|   5|   G|  -2|
|   6|   H|  -1|
|null|   I|  -1|
|null|null|  -1|
|   8|   I|  -1|
|   9|   J|  -1|
+----+----+----+

辅助功能1：

在当前子列表中 - 检查空值是否以相同的值为界：

def isRepeatBound(tempList):
    count=0
    startElt=tempList[0]
    for elt in tempList:
        if count < len(tempList)-1:
            count=count+1
            if(elt == None and tempList[count]== startElt):
                return True
    return False

辅助功能2：

平滑当前列表：

def smoothLst(lst):
valueFound =False
colLst=[]
index=0
smooth=False
for elt in lst:
    if (index ==0):
        if(elt is None and valueFound == False):
            colLst.append(elt)
        elif(elt is None and valueFound == True):
            if(smooth == True):
                colLst.append(lastValue)
            else:
                colLst.append(elt)
        elif(index < len(lst)-1 and isRepeatBound(lst[index:])==True):
            smooth=True
            lastValue=elt
            valueFound=True
            colLst.append(elt)
        else:
            smooth=False
            valueFound=False
            colLst.append(elt)
    else:
        if(elt is None and valueFound == False):
            colLst.append(elt)
        elif(elt is None and valueFound == True):
            if(smooth == True):
                colLst.append(lastValue)
            else:
                colLst.append(elt)
        elif(index < len(lst)-1 and isRepeatBound(lst[index:])==True):
            smooth=True
            lastValue=elt
            valueFound=True
            colLst.append(elt)
        else:
            smooth=False
            valueFound=False
            colLst.append(elt)
    index = index +1

return colLst

主要（）

迭代所有列。对于每列，将其转换为列表。使用上面的辅助函数平滑列表。将平滑列表存储在主列表中，该列表将转置并转换回最终结果DF。

colNames = df.schema.names resultlst=[] for name in colNames: tmpDict={} lst = df.select(df[name]).rdd.flatMap(lambda x: x).collect() smoothList = smoothLst(lst) resultlst.append(smoothList) transposeResultLst=list(map(list, zip(*resultlst))) resultDF = spark.sparkContext.parallelize(transposeResultLst).toDF(['x1','x2','x3']) resultDF.show() +----+----+----+ | x1| x2| x3| +----+----+----+ |null| A| 3| |null| B|null| | 3|null|null| | 3| C| -4| | 3| D| -2| | 3|null| -2| | 4|null| -2| | 5| G| -2| | 6| H| -1| | 6| I| -1| | 6| I| -1| | 8| I| -1| | 9| J| -1| +----+----+----+

PySpark：替换以相同值

1 个答案: