Question

我有几年的数据集，但有些值丢失。我想用“NAN”填充这些行这是一个示例数据：

year    month   day min
2011    1   1   -2.3
2011    1   2   -9.1
2011    1   3   -4.7
2011    1   4   -3.5
2011    1   6   -1.4
2011    1   7   0.1
2011    1   9   -6.3
2011    1   10  -9.4
2011    1   11  -13.3
2011    1   12  -17.9
2011    1   14  -11.8
2011    1   15  -11.2
2011    1   16  -7.1
2011    1   17  -7.6
2011    1   18  -9.9
2011    1   20  -6.9
2011    1   21  -8.8
2011    1   22  -11.3
2011    1   24  -3.1
2011    1   25  -0.7
2011    1   26  0.8
2011    1   27  -0.9
2011    1   28  -6.9
2011    1   29  -3.2
2011    1   30  -2.3
2011    1   31  -7

如您所见，在2011年的第一个月，许多价值缺失，我需要为此值打开一行，然后填写。有什么办法吗？

Answer 1

您需要在reindex创建的MultiIndex.from_arrays之前date_range：

start = '2011-01-01'
end = '2011-01-31'

rng = pd.date_range(start, end)
mux = pd.MultiIndex.from_arrays([rng.year, rng.month, rng.day], names=('year','month','day'))

df = df.set_index(['year','month','day'])

print (df.reindex(mux).reset_index())

    year  month  day   min
0   2011      1    1  -2.3
1   2011      1    2  -9.1
2   2011      1    3  -4.7
3   2011      1    4  -3.5
4   2011      1    5   NaN
5   2011      1    6  -1.4
6   2011      1    7   0.1
7   2011      1    8   NaN
8   2011      1    9  -6.3
9   2011      1   10  -9.4
10  2011      1   11 -13.3
11  2011      1   12 -17.9
12  2011      1   13   NaN
13  2011      1   14 -11.8
14  2011      1   15 -11.2
15  2011      1   16  -7.1
16  2011      1   17  -7.6
17  2011      1   18  -9.9
18  2011      1   19   NaN
19  2011      1   20  -6.9
20  2011      1   21  -8.8
21  2011      1   22 -11.3
22  2011      1   23   NaN
23  2011      1   24  -3.1
24  2011      1   25  -0.7
25  2011      1   26   0.8
26  2011      1   27  -0.9
27  2011      1   28  -6.9
28  2011      1   29  -3.2
29  2011      1   30  -2.3
30  2011      1   31  -7.0

Answer 2

将DataFrame转换为具有日期时间索引的时间序列，然后使用asfreq将索引的频率更改为每日（'D'）：

import pandas as pd

raw = """2011    1   1   -2.3
2011    1   2   -9.1
2011    1   3   -4.7
2011    1   4   -3.5
2011    1   6   -1.4"""

# Parse the rows into dates and values
new_rows = []
for row in raw.split('\n'):
    date = pd.to_datetime('/'.join(row.split()[:3]))
    value = row[-1]
    new_rows.append({'date': date, 'value': value})

timeseries = pd.DataFrame(new_rows).set_index('date')
timeseries.asfreq('D')

Answer 3

我认为df.replace()完成了这项工作：

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

print df.replace(r'\s+', np.nan, regex=True)

产地：

        A             B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

Answer 4

是的，请使用Pandas

使用您的日期作为索引创建数据框
使用asfreq

希望这会有所帮助，请参阅http://pandas.pydata.org/pandas-docs/stable/timeseries.html了解更多信息：）

用“NAN”填补缺失的空白

4 个答案: