如何获取适合另一个数据框范围的数据框行?例如:
import pandas as pd
df1 = pd.DataFrame({
'date': [
pd.Timestamp(2019,1,1),
pd.Timestamp(2019,1,2),
pd.Timestamp(2019,1,3),
pd.Timestamp(2019,2,1),
pd.Timestamp(2019,2,5)
]
})
df2 = pd.DataFrame({
'from_date': [pd.Timestamp(2019,1,1), pd.Timestamp(2019,2,1)],
'to_date': [pd.Timestamp(2019,1,2), pd.Timestamp(2019,2,1)]
})
数据:
> df1
date
0 2019-01-01 <- I want this
1 2019-01-02 <- and this
2 2019-01-03
3 2019-02-01 <- and this
4 2019-02-05
> df2
from_date to_date
0 2019-01-01 2019-01-02
1 2019-02-01 2019-02-01
范围可以相互重叠。我想找到df1
中所有属于df2
范围的 any 之间的所有行。我尝试过:
df1[df1['date'].between(df2['from_date'], df2['to_date'])]
但这会导致错误:
ValueError: Can only compare identically-labeled Series objects
答案 0 :(得分:2)
我正在使用numpy
广播
s2_1=df2.from_date.values
s2_2=df2.to_date.values
s1=df1.values[:,None]
df1[np.any((s1>=s2_1)&(s1<=s2_2),-1)]
Out[35]:
date
0 2019-01-01
1 2019-01-02
3 2019-02-01
答案 1 :(得分:2)
这是另一种方法:
1)使用列表推导numpy.hstack
和pandas.date_range
创建日期数组。
2)使用此日期数组和boolean indexing在df1
上简单地Series.isin
# step 1
dates = np.hstack([pd.date_range(s, e) for s, e in zip(df2['from_date'], df2['to_date'])])
# Step 2
df1[df1.date.isin(dates)]
date
0 2019-01-01
1 2019-01-02
3 2019-02-01
答案 2 :(得分:2)
不建议在大型数据框中使用的另一种方法是创建笛卡尔乘积并过滤结果:
{
"name": "client",
"version": "0.1.0",
"private": true,
"dependencies": {
"@material-ui/core": "^3.9.1",
"@material-ui/icons": "^3.0.2",
"axios": "^0.18.0",
"history": "^4.7.2",
"http-proxy-middleware": "^0.19.1",
"jsonwebtoken": "^8.4.0",
"jwt-decode": "^2.2.0",
"material-ui-icons": "^1.0.0-beta.36",
"moment": "^2.24.0",
"react": "^16.7.0",
"react-dom": "^16.7.0",
"react-redux": "^6.0.0",
"react-router-dom": "^4.3.1",
"react-scripts": "2.1.3",
"redux": "^4.0.1",
"redux-thunk": "^2.3.0",
"superagent": "^4.1.0"
},
"scripts": {
"start": "PORT=8001 react-scripts start",
"build": "react-scripts build",
"test": "react-scripts test",
"eject": "react-scripts eject",
"postinstall": "react-scripts build"
},
"eslintConfig": {
"extends": "react-app"
},
"browserslist": [
">0.2%",
"not dead",
"not ie <= 11",
"not op_mini all"
],
"devDependencies": {
"dotenv": "^6.2.0"
}
}
输出:
import pandas as pd
df1 = pd.DataFrame({
'date': [
pd.Timestamp(2019,1,1),
pd.Timestamp(2019,1,2),
pd.Timestamp(2019,1,3),
pd.Timestamp(2019,2,1),
pd.Timestamp(2019,2,5)
]
})
df2 = pd.DataFrame({
'from_date': [pd.Timestamp(2019,1,1), pd.Timestamp(2019,2,1)],
'to_date': [pd.Timestamp(2019,1,2), pd.Timestamp(2019,2,1)]
})
df1 = df1.apply(pd.to_datetime)
df2 = df2.apply(pd.to_datetime)
df_out = df1.assign(key=1).merge(df2.assign(key=1))\
.query('from_date <= date <= to_date')
df_out