选择属于另一个DataFrame中定义的范围的行

时间:2019-04-10 14:40:27

标签: python pandas dataframe

如何获取适合另一个数据框范围的数据框行?例如:

import pandas as pd

df1 = pd.DataFrame({
    'date': [
        pd.Timestamp(2019,1,1),
        pd.Timestamp(2019,1,2),
        pd.Timestamp(2019,1,3),
        pd.Timestamp(2019,2,1),
        pd.Timestamp(2019,2,5)
    ]
})

df2 = pd.DataFrame({
    'from_date': [pd.Timestamp(2019,1,1), pd.Timestamp(2019,2,1)],
    'to_date': [pd.Timestamp(2019,1,2), pd.Timestamp(2019,2,1)]
})

数据:

> df1
    date
0   2019-01-01   <- I want this
1   2019-01-02   <- and this
2   2019-01-03   
3   2019-02-01   <- and this
4   2019-02-05

> df2
    from_date   to_date
0   2019-01-01  2019-01-02
1   2019-02-01  2019-02-01

范围可以相互重叠。我想找到df1中所有属于df2范围的 any 之间的所有行。我尝试过:

df1[df1['date'].between(df2['from_date'], df2['to_date'])]

但这会导致错误:

ValueError: Can only compare identically-labeled Series objects

3 个答案:

答案 0 :(得分:2)

我正在使用numpy广播

s2_1=df2.from_date.values
s2_2=df2.to_date.values
s1=df1.values[:,None]
df1[np.any((s1>=s2_1)&(s1<=s2_2),-1)]
Out[35]: 
        date
0 2019-01-01
1 2019-01-02
3 2019-02-01

答案 1 :(得分:2)

这是另一种方法:

1)使用列表推导numpy.hstackpandas.date_range创建日期数组。

2)使用此日期数组和boolean indexingdf1上简单地Series.isin

# step 1
dates = np.hstack([pd.date_range(s, e) for s, e in zip(df2['from_date'], df2['to_date'])])

# Step 2
df1[df1.date.isin(dates)]

        date
0 2019-01-01
1 2019-01-02
3 2019-02-01

答案 2 :(得分:2)

不建议在大型数据框中使用的另一种方法是创建笛卡尔乘积并过滤结果:

{
  "name": "client",
  "version": "0.1.0",
  "private": true,
  "dependencies": {
    "@material-ui/core": "^3.9.1",
    "@material-ui/icons": "^3.0.2",
    "axios": "^0.18.0",
    "history": "^4.7.2",
    "http-proxy-middleware": "^0.19.1",
    "jsonwebtoken": "^8.4.0",
    "jwt-decode": "^2.2.0",
    "material-ui-icons": "^1.0.0-beta.36",
    "moment": "^2.24.0",
    "react": "^16.7.0",
    "react-dom": "^16.7.0",
    "react-redux": "^6.0.0",
    "react-router-dom": "^4.3.1",
    "react-scripts": "2.1.3",
    "redux": "^4.0.1",
    "redux-thunk": "^2.3.0",
    "superagent": "^4.1.0"
  },
  "scripts": {
    "start": "PORT=8001 react-scripts start",
    "build": "react-scripts build",
    "test": "react-scripts test",
    "eject": "react-scripts eject",
    "postinstall": "react-scripts build"
  },
  "eslintConfig": {
    "extends": "react-app"
  },
  "browserslist": [
    ">0.2%",
    "not dead",
    "not ie <= 11",
    "not op_mini all"
  ],
  "devDependencies": {
    "dotenv": "^6.2.0"
  }
}

输出:

import pandas as pd

df1 = pd.DataFrame({
    'date': [
        pd.Timestamp(2019,1,1),
        pd.Timestamp(2019,1,2),
        pd.Timestamp(2019,1,3),
        pd.Timestamp(2019,2,1),
        pd.Timestamp(2019,2,5)
    ]
})

df2 = pd.DataFrame({
    'from_date': [pd.Timestamp(2019,1,1), pd.Timestamp(2019,2,1)],
    'to_date': [pd.Timestamp(2019,1,2), pd.Timestamp(2019,2,1)]
})

df1 = df1.apply(pd.to_datetime)

df2 = df2.apply(pd.to_datetime)

df_out = df1.assign(key=1).merge(df2.assign(key=1))\
            .query('from_date <= date <= to_date')

df_out