如何创建带有行数+同一行的下一个值的数据框?

时间:2019-04-14 11:21:41

标签: dataframe pyspark

  

假设我们有以下数据框:

# a b       c       d
# 1 10:10   red     open
# 2 11:12   blau    closed
# 3 11:30   black   closed
# 4 02:13   red     open
# 5 03:00   yellow  closed
# 6 03:18   white   closed
# 7 04:15   red     open
# 8 06:00   black   closed
  

我想创建一个neu数据框,该数据框在红色紧贴后采用列c的第一个元素。像这样:

# a b       c       d
# 1 10:10   red     open
# 2 11:12   blau    closed
# 4 02:13   red     open
# 5 03:00   yellow  closed
# 7 04:15   red     open
# 8 06:00   black   closed
  

我将不胜感激。谢谢你的问候...强文本

1 个答案:

答案 0 :(得分:0)

使用lag,我们可以访问以前的行数据,这是解决方案

from pyspark.sql.window import Window
from pyspark.sql.functions import col,lag,when

df = spark.createDataFrame(([1,'10:10','red','open'],
                            [2,'11:12','blau','closed'],
                            [3,'11:30','black','closed'],
                            [4,'02:13','red','open'],
                            [5,'03:00','yellow','closed'],
                            [6,'03:18','white','closed'],
                            [7,'04:15','red','open'],
                            [8,'06:00','black','closed'])).toDF("a","b","c","d")

window = Window.orderBy("a")
df = df.withColumn("prev_row", lag("c",1,"red").over(window))
df = df.withColumn("selected", when(col('c') == 'red', "true").when(col('prev_row') == 'red', "true").otherwise("false"))
df = df.filter(col("selected") == "true").drop("prev_row","selected")
df.show()

结果

+---+-----+------+------+
|  a|    b|     c|     d|
+---+-----+------+------+
|  1|10:10|   red|  open|
|  2|11:12|  blau|closed|
|  4|02:13|   red|  open|
|  5|03:00|yellow|closed|
|  7|04:15|   red|  open|
|  8|06:00| black|closed|
+---+-----+------+------+