熊猫填写空白日期

时间:2020-07-25 16:45:17

标签: python pandas datetime

出于好奇,您如何用空白值(用于绘图目的)填充日期值

date = '''day   date    tempMLD salinityMLD densityMLD
    1   9/12/2014   177.859887  177.859887  177.859887
    2   9/13/2014   197.2614444 197.2614444 197.2614444
    3   9/14/2014   199.5787079 199.5787079 199.5787079
    5   9/16/2014   197.2535    197.2535    197.2535
    7   9/18/2014   195.9107222 195.9107222 195.9107222
    8   9/19/2014   200.7785    200.7785    200.7785
    10  9/21/2014   191.3220225 191.3220225 191.3220225
    12  9/23/2014   179.5676966 179.5676966 179.5676966
    13  9/24/2014   180.7201124 180.7201124 180.7201124
    15  9/26/2014   170.139382  170.139382  170.139382
    17  9/28/2014   171.7347753 171.7347753 171.7347753
    18  9/29/2014   180.4120787 180.4120787 180.4120787
    20  10/1/2014   221.9926404 221.9926404 221.9926404
    22  10/3/2014   177.458764  177.458764  177.458764
    23  10/4/2014   171.9423034 171.9423034 171.9423034
    25  10/6/2014   195.6371348 195.6371348 195.6371348
    27  10/8/2014   190.0867416 190.0867416 190.0867416
    28  10/9/2014   171.4321348 171.4321348 171.4321348
    30  10/11/2014  174.5272472 174.5272472 174.5272472
    32  10/13/2014  198.0153889 198.0153889 198.0153889'''

当前它进行这样的绘图,因为它的编程方式是每个月的第一天将字母关联起来。由于缺少第一个,这就是发生的情况。原始数据在一个csv文件中,我什至试图用所需的日期范围制作df2

df = pd.read_csv('/content/drive/My Drive/Irminger_2020_Project_Colab_Notebooks/Apex_Array/Ready to Graph/profiler/MLD.csv',sep = ',',encoding='utf-8-sig',)
idx = pd.date_range('09-12-2014', '06-29-2019')
df['date'] = pd.to_datetime(df['date'])
s = df
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx,)
s

enter image description here

但这似乎无法正常工作,因为它填充了所有内容的NaN

enter image description here

1 个答案:

答案 0 :(得分:0)

首先是创建数据框的设置工作。导入数据,设置数据类型(适当的日期,整数,浮点数)。

columns = data.split('\n')[0].split()

records = list()
for record in data.split('\n')[1:]:
    records.append((record.split()))    

df = (pd.DataFrame(data=records, columns=columns)
      .assign(date = lambda x: pd.to_datetime(x['date']))
      .set_index('date')
      .sort_index()
     )

int_fields = ['day']
df[int_fields] = df[int_fields].astype(int)

float_fields = ['tempMLD', 'salinityMLD', 'densityMLD']
df[float_fields] = df[float_fields].astype(float)

现在将日期设置为索引,然后重新索引以消除丢失的日期:

idx = pd.date_range(start=df.index.min(), end=df.index.max())
df = df.reindex(index=idx)

最后,对每一列进行插值以替换NaN值:

for col in df.columns:
    df[col] = df[col].interpolate()

现在我们看到9/15,它不在原始数据中:

print(df.head())

            day     tempMLD  salinityMLD  densityMLD
2014-09-12  1.0  177.859887   177.859887  177.859887
2014-09-13  2.0  197.261444   197.261444  197.261444
2014-09-14  3.0  199.578708   199.578708  199.578708
2014-09-15  4.0  198.416104   198.416104  198.416104
2014-09-16  5.0  197.253500   197.253500  197.253500