如何使用Python(和pandas)将时间间隔数据转换为时间序列数据?
这是我之前的数据框作为时间间隔:
code start_dt end_dt ent_value
156600 1960-01-01 2016-04-21 H:CXP
156600 1960-01-01 2016-01-03 46927
156600 1998-08-31 2016-01-03 5516751
156600 1960-01-01 1998-08-30 4501242
对于代码和ent_value的每个组合,我们希望在该组合的开始和结束日期(以及时间序列)中每天的帧中有一行:
code as_of_dt ent_value
156600 1960-01-01 H:CXP
156600 1960-01-02 H:CXP
156600 1960-01-03 H:CXP
156600 1960-01-01 46927
156600 1960-01-02 46927
156600 1960-01-03 46927
156600 1960-01-01 5516751
156600 1960-01-02 5516751
156600 1960-01-03 5516751
...
156600 2016-01-01 H:CXP
156600 2016-01-02 H:CXP
156600 2016-01-03 H:CXP
156600 2016-01-01 46927
156600 2016-01-02 46927
156600 2016-01-03 46927
156600 2016-01-01 5516751
156600 2016-01-02 5516751
156600 2016-01-03 5516751
我该如何以有效的方式做到这一点?
答案 0 :(得分:1)
首先,亲自尝试! 然后,如果你没有成功,这是一个可能的解决方案。
data = pd.read_csv(open('/tmp/test.tab', 'r'), sep='\t')
tmp = [(e.code, pd.date_range(e.start_dt, e.end_dt, freq='1D'),
e.ent_value) for e in data.itertuples()]
res = [(line[0], date, line[2]) for date in line[1] for line in tmp]
df = pd.DataFrame(res)
函数pd.date_range()
用于创建日期范围。
答案 1 :(得分:0)
试试这个:
In [17]: %paste
(df.groupby(['code','ent_value'])
.apply(lambda x: pd.DataFrame({'as_of_dt':pd.date_range(x.start_dt.min(), x.end_dt.max())}))
.reset_index()
.drop('level_2', 1)
)
## -- End pasted text --
Out[17]:
code ent_value as_of_dt
0 156600 4501242 1960-01-01
1 156600 4501242 1960-01-02
2 156600 4501242 1960-01-03
3 156600 4501242 1960-01-04
4 156600 4501242 1960-01-05
5 156600 4501242 1960-01-06
6 156600 4501242 1960-01-07
7 156600 4501242 1960-01-08
8 156600 4501242 1960-01-09
9 156600 4501242 1960-01-10
10 156600 4501242 1960-01-11
11 156600 4501242 1960-01-12
12 156600 4501242 1960-01-13
13 156600 4501242 1960-01-14
14 156600 4501242 1960-01-15
15 156600 4501242 1960-01-16
16 156600 4501242 1960-01-17
17 156600 4501242 1960-01-18
18 156600 4501242 1960-01-19
19 156600 4501242 1960-01-20
20 156600 4501242 1960-01-21
21 156600 4501242 1960-01-22
22 156600 4501242 1960-01-23
23 156600 4501242 1960-01-24
24 156600 4501242 1960-01-25
25 156600 4501242 1960-01-26
26 156600 4501242 1960-01-27
27 156600 4501242 1960-01-28
28 156600 4501242 1960-01-29
29 156600 4501242 1960-01-30
... ... ... ...
61450 156600 H:CXP 2016-03-23
61451 156600 H:CXP 2016-03-24
61452 156600 H:CXP 2016-03-25
61453 156600 H:CXP 2016-03-26
61454 156600 H:CXP 2016-03-27
61455 156600 H:CXP 2016-03-28
61456 156600 H:CXP 2016-03-29
61457 156600 H:CXP 2016-03-30
61458 156600 H:CXP 2016-03-31
61459 156600 H:CXP 2016-04-01
61460 156600 H:CXP 2016-04-02
61461 156600 H:CXP 2016-04-03
61462 156600 H:CXP 2016-04-04
61463 156600 H:CXP 2016-04-05
61464 156600 H:CXP 2016-04-06
61465 156600 H:CXP 2016-04-07
61466 156600 H:CXP 2016-04-08
61467 156600 H:CXP 2016-04-09
61468 156600 H:CXP 2016-04-10
61469 156600 H:CXP 2016-04-11
61470 156600 H:CXP 2016-04-12
61471 156600 H:CXP 2016-04-13
61472 156600 H:CXP 2016-04-14
61473 156600 H:CXP 2016-04-15
61474 156600 H:CXP 2016-04-16
61475 156600 H:CXP 2016-04-17
61476 156600 H:CXP 2016-04-18
61477 156600 H:CXP 2016-04-19
61478 156600 H:CXP 2016-04-20
61479 156600 H:CXP 2016-04-21
[61480 rows x 3 columns]
使用较小的日期范围测试DF:
In [19]: df
Out[19]:
code start_dt end_dt ent_value
0 156600 1960-01-01 1960-01-04 H:CXP
1 156600 1960-01-04 1960-01-09 46927
2 156600 1998-08-31 1998-09-04 5516751
3 156600 1965-01-01 1965-01-04 4501242
In [20]: (df.groupby(['code','ent_value'])
....: .apply(lambda x: pd.DataFrame({'as_of_dt':pd.date_range(x.start_dt.min(), x.end_dt.max())}))
....: .reset_index()
....: .drop('level_2', 1)
....: )
Out[20]:
code ent_value as_of_dt
0 156600 4501242 1965-01-01
1 156600 4501242 1965-01-02
2 156600 4501242 1965-01-03
3 156600 4501242 1965-01-04
4 156600 46927 1960-01-04
5 156600 46927 1960-01-05
6 156600 46927 1960-01-06
7 156600 46927 1960-01-07
8 156600 46927 1960-01-08
9 156600 46927 1960-01-09
10 156600 5516751 1998-08-31
11 156600 5516751 1998-09-01
12 156600 5516751 1998-09-02
13 156600 5516751 1998-09-03
14 156600 5516751 1998-09-04
15 156600 H:CXP 1960-01-01
16 156600 H:CXP 1960-01-02
17 156600 H:CXP 1960-01-03
18 156600 H:CXP 1960-01-04
答案 2 :(得分:0)
假设您拥有以下名为df
的数据框(请参见下文以了解如何创建它):
(see below to recreate this example)
id starttime endtime flag
0 A 2020-03-18 2020-03-20 y
1 B 2020-03-20 2020-03-23 n
2 C 2020-03-19 2020-03-21 y
然后,您可以通过在date_range的帮助下遍历所有列来创建新的数据框:
new_df = pd.DataFrame(
data = ((row.id, row.flag, date)
# iterate over rows
for row in df.itertuples()
# expad the range into 1 day intervals
for date in pd.date_range(row.starttime, row.endtime, freq='1D')),
columns = ['name', 'flag', 'interval']))
您将以此结束:
name flag interval
0 A y 2020-03-18
1 A y 2020-03-19
2 A y 2020-03-20
3 B n 2020-03-20
4 B n 2020-03-21
5 B n 2020-03-22
6 B n 2020-03-23
7 C y 2020-03-19
8 C y 2020-03-20
9 C y 2020-03-21
import pandas as pd
df = pd.DataFrame({
'id': ['A', 'B', 'C'],
'starttime': ['2020-03-18', '2020-03-20','2020-03-19' ],
'endtime': ['2020-03-20', '2020-03-23','2020-03-21'],
'flag': ['y','n','y']
})
df['starttime'] = pd.to_datetime(df['starttime'])
df['endtime'] = pd.to_datetime(df['endtime'])