Python-拉伸查找最大空值并替换为0

时间:2019-04-09 06:51:35

标签: python pandas

我有带有日期时间和一列的数据框。我必须在``特定日期''中找到空值的最大拉伸并将其替换为零。在下面的示例中,1月1日最大拉伸空值是3倍,因此我必须替换为零。同样,我必须复制1月2日的流程。

注意:只需将最大的空值范围替换为零,而不是其他值即可。

以下是我的示例数据:

Datetime            X
01-01-2018 00:00    1
01-01-2018 00:05    Nan
01-01-2018 00:10    2
01-01-2018 00:15    3
01-01-2018 00:20    2
01-01-2018 00:25    Nan
01-01-2018 00:30    Nan
01-01-2018 00:35    Nan
01-01-2018 00:40    4
02-01-2018 00:00    Nan
02-01-2018 00:05    2
02-01-2018 00:10    2
02-01-2018 00:15    2
02-01-2018 00:20    2
02-01-2018 00:25    Nan
02-01-2018 00:30    Nan
02-01-2018 00:35    3
02-01-2018 00:40    Nan

2 个答案:

答案 0 :(得分:0)

使用:

#convert columns to floats and datetimes
df['X'] = df['X'].astype(float)
df['Datetime'] = pd.to_datetime(df['Datetime'], dayfirst=True)

#check missing values
s = df['X'].isna()
#create consecutive groups 
g = s.ne(s.shift()).cumsum()
#get dates from datetimes
dates = df['Datetime'].dt.date

#get counts of consecutive NaNs
sizes = s.groupby([g[s], dates[s]]).transform('count')

#compare max count per dates to mask
mask = sizes.groupby(dates).transform('max').eq(sizes)

#set 0 by mask
df.loc[mask, 'X'] = 0

print (df)
              Datetime    X
0  2018-01-01 00:00:00  1.0
1  2018-01-01 00:05:00  NaN
2  2018-01-01 00:10:00  2.0
3  2018-01-01 00:15:00  3.0
4  2018-01-01 00:20:00  2.0
5  2018-01-01 00:25:00  0.0
6  2018-01-01 00:30:00  0.0
7  2018-01-01 00:35:00  0.0
8  2018-01-01 00:40:00  4.0
9  2018-01-02 00:00:00  NaN
10 2018-01-02 00:05:00  2.0
11 2018-01-02 00:10:00  2.0
12 2018-01-02 00:15:00  2.0
13 2018-01-02 00:20:00  2.0
14 2018-01-02 00:25:00  0.0
15 2018-01-02 00:30:00  0.0
16 2018-01-02 00:35:00  3.0
17 2018-01-02 00:40:00  NaN

编辑:您可以创建filtered的所有日期时间列表,以进行替换和链接,以及掩码,以&进行按位AND测试缺失值:

sizes = s.groupby([g[s & m], dates[s & m]]).transform('count')

一起:

df['X'] = df['X'].astype(float)
df['Datetime'] = pd.to_datetime(df['Datetime'], dayfirst=True)

#check missing values
s = df['X'].isna()
#create consecutive groups 
g = s.ne(s.shift()).cumsum()
#get dates from datetimes
dates = df['Datetime'].dt.floor('d')

filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)

#get counts of consecutive NaNs
sizes = s.groupby([g[s & m], dates[s & m]]).transform('count')

#compare max count per dates to mask
mask = sizes.groupby(dates).transform('max').eq(sizes)

#set 0 by mask
df.loc[mask, 'X'] = 0

print (df)
              Datetime    X
0  2018-01-01 00:00:00  1.0
1  2018-01-01 00:05:00  NaN
2  2018-01-01 00:10:00  2.0
3  2018-01-01 00:15:00  3.0
4  2018-01-01 00:20:00  2.0
5  2018-01-01 00:25:00  0.0
6  2018-01-01 00:30:00  0.0
7  2018-01-01 00:35:00  0.0
8  2018-01-01 00:40:00  4.0
9  2018-01-02 00:00:00  NaN
10 2018-01-02 00:05:00  2.0
11 2018-01-02 00:10:00  2.0
12 2018-01-02 00:15:00  2.0
13 2018-01-02 00:20:00  2.0
14 2018-01-02 00:25:00  NaN
15 2018-01-02 00:30:00  NaN
16 2018-01-02 00:35:00  3.0
17 2018-01-02 00:40:00  NaN

答案 1 :(得分:0)

有趣的问题。

我的解决方案是用scala编写的,但是我很确定有一个等效于它的python。 首先-设置。我使用了case class KV;在您的示例中,键是日期,值是X列。

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{functions => F}

case class KV(k: String, v: Double)
val ds = Seq(("a", 0.0),
    ("a", Double.NaN),
    ("a", Double.NaN),
    ("b", Double.NaN),
    ("b", Double.NaN)).toDF("k", "v").as[KV]
val win = Window.partitionBy("k")

def countConsecutiveNans(s: String, iter: Iterator[KV]): Int = {
  (0 /: iter)((cnt: Int, kv: KV) => if (kv.v.isNaN) cnt+1 else 0)
}
ds.groupByKey(kv => kv.k).mapGroups(countConsecutiveNans)

结果数据集为:

+-----+
|value|
+-----+
|    2|
|    2|
+-----+

希望它有帮助!