用pandas reindex函数填充缺失的数据行

时间:2015-10-06 18:23:52

标签: python pandas missing-data

我正在尝试使用pandas reindex函数填充我的时间序列数据中的缺失行。 我的数据如下:

 100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
 100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
 100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
 100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
 100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00
                                       .
                                       .

是第一列表示的一分钟时间间隔的一天的时间序列数据。与正常时间序列索引不同,该数据的时间索引看起来像0到59,100到159 ...... 2300到2359,因为1天是24小时,1小时是60分钟。所以,填补与“南方”的差距。价值,我把代码作为下面的代码:

S = []
for i in range(0,24):

     s = np.arange(i*100,i*100+60)
     s = list(s)
S = S + s

pd.set_option('max_rows',10)
for INPUT in FileList:
     output = INPUT + "result" # set the output files
     data=pd.read_csv(INPUT,sep=',',index_col=[3],parse_dates=[3])
     index = 'S'#make the reference index to fill
     df = data
     sk_f = df.reindex(index)       
     sk_f.to_csv(output,na_rep='nan')

通过这段代码,我的目的是通过“' nan'基于作为参考索引的S的第四列中的指示。 但结果却只是排成一排的' nan'而不是如下所示填补空白:

,100,2007,241,22.471,-31.002,-999.0,-999.0.1,-999.0.2,-999.00,13.294,-999.00    .1
0,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
1,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan 
2,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
3,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
4,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
5,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
6,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
7,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan 
8,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
9,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
10,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
11,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan

我的期望是填补原始数据中缺失线的空白。例如,在原始数据中,0到3索引行之间没有低点。所以我想用原始数据格式填充这些行。 我可能会错过一些东西 如果你能给出任何想法或帮助,我将非常感激。

谢谢你, 艾萨克

1 个答案:

答案 0 :(得分:1)

首先,我发现有问题的缩进与创建列表S = S + s。您必须使用,因为列表S仅保留了s

S = []
for i in range(0,24):

     s = np.arange(i*100,i*100+60)
     s = list(s)
S = S + s #keep only last s

到:

S = []
for i in range(0,24):
    s = np.arange(i*100,i*100+60)
    s = list(s)
    S = S + s

或更短:

S = []
for i in range(0,24):
    S = S + list(np.arange(i*100,i*100+60))

接下来是有问题的index = 'S'我认为,它是拼写错误,可能是index = S。 您可以添加函数bfill()并向后填补空白。 link

sk_f = df.reindex(index).bfill()

代码:

import pandas as pd
import numpy as np
import io

S = []
for i in range(0,24):
    S = S + list(np.arange(i*100,i*100+60))

#original data
temp=u"""100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00"""

#pd.set_option('max_rows',10)

data=pd.read_csv(io.StringIO(temp),sep=',', header=None, index_col=[3], parse_dates=[3])
data.index.name = None
print data

#     0     1    2       4       5    6    7    8    9       10   11
#4   100  2007  239  29.588 -30.851 -999 -999 -999 -999  13.125 -999
#5   100  2007  239  29.573 -30.843 -999 -999 -999 -999  13.126 -999
#14  100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#15  100  2007  239  29.367 -30.901 -999 -999 -999 -999  13.131 -999
#24  100  2007  239  29.374 -30.920 -999 -999 -999 -999  13.135 -999

index = S #make the reference index to fill
df = data
sk_f = df.reindex(index).bfill()

print sk_f.head(20)
#     0     1    2       4       5    6    7    8    9       10   11
#0   100  2007  239  29.588 -30.851 -999 -999 -999 -999  13.125 -999
#1   100  2007  239  29.588 -30.851 -999 -999 -999 -999  13.125 -999
#2   100  2007  239  29.588 -30.851 -999 -999 -999 -999  13.125 -999
#3   100  2007  239  29.588 -30.851 -999 -999 -999 -999  13.125 -999
#4   100  2007  239  29.588 -30.851 -999 -999 -999 -999  13.125 -999
#5   100  2007  239  29.573 -30.843 -999 -999 -999 -999  13.126 -999
#6   100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#7   100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#8   100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#9   100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#10  100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#11  100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#12  100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#13  100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#14  100  2007  239  29.389 -30.880 -999 -999 -999 -999  13.131 -999
#15  100  2007  239  29.367 -30.901 -999 -999 -999 -999  13.131 -999
#16  100  2007  239  29.374 -30.920 -999 -999 -999 -999  13.135 -999
#17  100  2007  239  29.374 -30.920 -999 -999 -999 -999  13.135 -999
#18  100  2007  239  29.374 -30.920 -999 -999 -999 -999  13.135 -999
#19  100  2007  239  29.374 -30.920 -999 -999 -999 -999  13.135 -999