我有两个不同的pandas DataFrames,我想在其他DataFrame同时具有特定值时从一个DataFrame中提取数据。具体来说,我有一个名为" GDP"看起来如下:
GDP
DATE
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
我还有一个名为"经济衰退"的数据框架。其中包含以下数据:
USRECQ
DATE
1949-07-01 1
1949-10-01 1
1950-01-01 0
我想创建两个新的时间序列。每当USRECQ在同一日期的值为0时,应包含GDP数据。每当USRECQ在同一日期的值为1时,另一个应该包含GDP数据。我怎样才能做到这一点?
答案 0 :(得分:4)
让我们修改您发布的示例,以便日期重叠:
import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
index=pd.date_range('2000-1-1', periods=10, freq='D'))
# GDP
# 2000-01-01 0
# 2000-01-02 10
# 2000-01-03 20
# 2000-01-04 30
# 2000-01-05 40
# 2000-01-06 50
# 2000-01-07 60
# 2000-01-08 70
# 2000-01-09 80
# 2000-01-10 90
recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
index=pd.date_range('2000-1-2', periods=10, freq='D'))
# USRECQ
# 2000-01-02 0
# 2000-01-03 0
# 2000-01-04 0
# 2000-01-05 0
# 2000-01-06 0
# 2000-01-07 1
# 2000-01-08 1
# 2000-01-09 1
# 2000-01-10 1
# 2000-01-11 1
然后你可以加入两个数据帧:
combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
# GDP USRECQ
# 2000-01-01 0 NaN
# 2000-01-02 10 0
# 2000-01-03 20 0
# 2000-01-04 30 0
# 2000-01-05 40 0
# 2000-01-06 50 0
# 2000-01-07 60 1
# 2000-01-08 70 1
# 2000-01-09 80 1
# 2000-01-10 90 1
# 2000-01-11 NaN 1
并根据以下条件选择行:
In [112]: combined.loc[combined['USRECQ']==0]
Out[112]:
GDP USRECQ
2000-01-02 10 0
2000-01-03 20 0
2000-01-04 30 0
2000-01-05 40 0
2000-01-06 50 0
In [113]: combined.loc[combined['USRECQ']==1]
Out[113]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
要获得GDP列,请将列名称作为combined.loc
的第二项:
In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]:
2000-01-07 60
2000-01-08 70
2000-01-09 80
2000-01-10 90
2000-01-11 NaN
Freq: D, Name: GDP, dtype: float64
正如PaulH所指出的那样,你也可以使用query
,它有更好的语法:
In [118]: combined.query('USRECQ==1')
Out[118]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1