PySpark:替换以相同值

时间:2018-02-17 06:24:47

标签: pyspark apache-spark-sql spark-dataframe pyspark-sql

问:有没有办法用相同列中其他行的值替换指定列中的空值。

我想替换位于指定列中具有smae值的两行之间的空值

               Orignal DF                                          Desired DF
+-----+----+-----+-----+----+----------+            +-----+----+-----+-----+----+----------+
|namea|Exam|name1|math1|phy1|Prev_Rank1|            |namea|Exam|name1|math1|phy1|Prev_Rank1|
+-----+----+-----+-----+----+----------+            +-----+----+-----+-----+----+----------+
|Ahmad|  48| null| null|null|      null|            |Ahmad|  48| null| null|null|      null|
|Ahmad|  49| null| null|null|      null|            |Ahmad|  49| null| null|null|      null|
|Ahmad|  50|Ahmad|   50|  54|         3|            |Ahmad|  50|Ahmad|   50|  54|         3|
|Ahmad|  51| null| null|null|      null|            |Ahmad|  51 Ahmad|   50|  54|         3|
|Ahmad|  53| null| null|null|      null|            |Ahmad|  53|Ahmad|   50|  54|         3|
|Ahmad|  54|Ahmad|   50|  54|         3|  >>>>>>>>  |Ahmad|  54|Ahmad|   50|  54|         3|
|Ahmad|  88| null| null|null|      null|  >>>>>>>>  |Ahmad|  88| null| null|null|      null|
|Ahmad|  90|Ahmad|  100|  90|         2|            |Ahmad|  90|Ahmad|  100|  90|         2|
|Ahmad|  95| null| null|null|      null|            |Ahmad|  95|Ahmad|  100|  90|         2|
|Ahmad| 100|Ahmad|  100|  90|         2|            |Ahmad| 100|Ahmad|  100|  90|         2|
|Ahmad| 101| null| null|null|      null|            |Ahmad| 101| null| null|null|      null|
| Emma|  52| Emma|   52|  85|         1|            | Emma|  52| Emma|   52|  85|         1|
| Emma|  85| Emma|   52|  85|         1|            | Emma|  85| Emma|   52|  85|         1|
+-----+----+-----+-----+----+----------+            +-----+----+-----+-----+----+----------+

我尝试使用以下步骤替换null值:

DF7=Orignal_DF.withColumn("name1", fn.last('name1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize,0)))
DF7=DF7.withColumn("math1", fn.last('math1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))
DF7=DF7.withColumn("phy1", fn.last('phy1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))
DF7=DF7.withColumn("Prev_Rank1", fn.last('Prev_Rank1', True).over(Window.partitionBy('namea').orderBy('Exam').rowsBetween(-sys.maxsize, 0)))

结果Df是:

+-----+----+-----+-----+----+----------+
|namea|Exam|name1|math1|phy1|Prev_Rank1|
+-----+----+-----+-----+----+----------+
|Ahmad|  48| null| null|null|      null|
|Ahmad|  49| null| null|null|      null|
|Ahmad|  50|Ahmad|   50|  54|         3|
|Ahmad|  51|Ahmad|   50|  54|         3|
|Ahmad|  53|Ahmad|   50|  54|         3|
|Ahmad|  54|Ahmad|   50|  54|         3|
|Ahmad|  88|Ahmad|   50|  54|         3|
|Ahmad|  90|Ahmad|  100|  90|         2|
|Ahmad|  95|Ahmad|  100|  90|         2|
|Ahmad| 100|Ahmad|  100|  90|         2|
|Ahmad| 101|Ahmad|  100|  90|         2|
| Emma|  52| Emma|   52|  85|         1|
| Emma|  85| Emma|   52|  85|         1|
+-----+----+-----+-----+----+----------+

1 个答案:

答案 0 :(得分:2)

这是一个可能的解决方案:

采用的方法是将数据框的每一列转换为一个列表。然后,对于每个列表,平滑由重复值限制的那些空值。然后将列表重新组合成结果列表,然后将其转换回数据帧。

设置测试数据集如下:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('testSmothing').getOrCreate()
df=spark.createDataFrame(data=[(None,'A',3)\
                               ,(None,'B',None)\
                               ,(3,None,None)\
                               ,(None,'C',-4)\
                               ,(None,'D',-2)\
                               ,(3,None,None)\
                               ,(4,None,None)\
                               ,(5,'G',-2)\
                               ,(6,'H',-1)\
                               ,(None,'I',-1)\
                               ,(None,None,-1)\
                               ,(8,'I',-1)\
                               ,(9,'J',-1)]\
                               ,schema=['x1','x2','x3'])

df.show()

+----+----+----+
|  x1|  x2|  x3|
+----+----+----+
|null|   A|   3|
|null|   B|null|
|   3|null|null|
|null|   C|  -4|
|null|   D|  -2|
|   3|null|null|
|   4|null|null|
|   5|   G|  -2|
|   6|   H|  -1|
|null|   I|  -1|
|null|null|  -1|
|   8|   I|  -1|
|   9|   J|  -1|
+----+----+----+

辅助功能1:

在当前子列表中 - 检查空值是否以相同的值为界:

def isRepeatBound(tempList):
    count=0
    startElt=tempList[0]
    for elt in tempList:
        if count < len(tempList)-1:
            count=count+1
            if(elt == None and tempList[count]== startElt):
                return True
    return False

辅助功能2:

平滑当前列表:

def smoothLst(lst):
valueFound =False
colLst=[]
index=0
smooth=False
for elt in lst:
    if (index ==0):
        if(elt is None and valueFound == False):
            colLst.append(elt)
        elif(elt is None and valueFound == True):
            if(smooth == True):
                colLst.append(lastValue)
            else:
                colLst.append(elt)
        elif(index < len(lst)-1 and isRepeatBound(lst[index:])==True):
            smooth=True
            lastValue=elt
            valueFound=True
            colLst.append(elt)
        else:
            smooth=False
            valueFound=False
            colLst.append(elt)
    else:
        if(elt is None and valueFound == False):
            colLst.append(elt)
        elif(elt is None and valueFound == True):
            if(smooth == True):
                colLst.append(lastValue)
            else:
                colLst.append(elt)
        elif(index < len(lst)-1 and isRepeatBound(lst[index:])==True):
            smooth=True
            lastValue=elt
            valueFound=True
            colLst.append(elt)
        else:
            smooth=False
            valueFound=False
            colLst.append(elt)
    index = index +1

return colLst

主要()

迭代所有列。对于每列,将其转换为列表。使用上面的辅助函数平滑列表。将平滑列表存储在主列表中,该列表将转置并转换回最终结果DF。

colNames = df.schema.names
resultlst=[]
for name in colNames:
    tmpDict={}
    lst = df.select(df[name]).rdd.flatMap(lambda x: x).collect()
    smoothList = smoothLst(lst)
    resultlst.append(smoothList)

transposeResultLst=list(map(list, zip(*resultlst)))
resultDF = spark.sparkContext.parallelize(transposeResultLst).toDF(['x1','x2','x3'])

resultDF.show()

+----+----+----+
|  x1|  x2|  x3|
+----+----+----+
|null|   A|   3|
|null|   B|null|
|   3|null|null|
|   3|   C|  -4|
|   3|   D|  -2|
|   3|null|  -2|
|   4|null|  -2|
|   5|   G|  -2|
|   6|   H|  -1|
|   6|   I|  -1|
|   6|   I|  -1|
|   8|   I|  -1|
|   9|   J|  -1|
+----+----+----+