Python Pandas基于其他df在一个df中删除行

时间:2017-05-17 01:31:52

标签: python pandas dataframe

我有两个dfs,如下所示。我想从PersonDate_List中删除userid的所有行,其日期时间小于df_userid_date中相同用户ID的min('datetime')。

PersonDate_List (cols={'userid','datetime'})
userid  datetime    Score
AB-4243 2/1/2016    0
AB-4243 2/2/2016    0
AB-4243 2/3/2016    0
AB-4243 2/4/2016    0
AB-4243 2/5/2016    0
AB-4243 2/6/2016    76
AB-4243 2/7/2016    84
AB-4243 2/8/2016    84
AB-4243 2/9/2016    81
AB-4243 2/10/2016   79
NP-7585 2/1/2016    22
NP-7585 2/2/2016    23.5
NP-7585 2/3/2016    30.15
NP-7585 2/4/2016    30.15
NP-7585 2/5/2016    30.15
NP-7585 2/6/2016    30.15
NP-7585 2/7/2016    0
NP-7585 2/8/2016    0
NP-7585 2/9/2016    22.5
NP-7585 2/10/2016   45.67
VX-4376 2/1/2016    0
VX-4376 2/2/2016    0
VX-4376 2/3/2016    0
VX-4376 2/4/2016    0
VX-4376 2/5/2016    0
VX-4376 2/6/2016    0
VX-4376 2/7/2016    0
VX-4376 2/8/2016    0
VX-4376 2/9/2016    0
VX-4376 2/10/2016   33.13

df_userid_date (cols={'userid','datetime'})
userid  datetime
AB-4243 2/6/2016
AB-4243 2/7/2016
AB-4243 2/9/2016
AB-4243 2/10/2016
NP-7585 2/1/2016
NP-7585 2/2/2016
NP-7585 2/3/2016
NP-7585 2/7/2016
NP-7585 2/8/2016
NP-7585 2/9/2016
NP-7585 2/10/2016
VX-4376 2/10/2016

我正在寻找下面的结果;

userid  datetime    Score
AB-4243 2/6/2016    76
AB-4243 2/7/2016    84
AB-4243 2/8/2016    84
AB-4243 2/9/2016    81
AB-4243 2/10/2016   79
NP-7585 2/1/2016    22
NP-7585 2/2/2016    23.5
NP-7585 2/3/2016    30.15
NP-7585 2/4/2016    30.15
NP-7585 2/5/2016    30.15
NP-7585 2/6/2016    30.15
NP-7585 2/7/2016    0
NP-7585 2/8/2016    0
NP-7585 2/9/2016    22.5
NP-7585 2/10/2016   45.67
VX-4376 2/10/2016   33.13

我尝试在df_userid_date中添加一个最小日期标志然后合并它,但我无法在这里得到条件。

2 个答案:

答案 0 :(得分:2)

尝试这种方式:

import numpy as np
df1 = pd.read_csv('PersonDate.csv')
df2 = pd.read_csv('useriddate.csv')
df1['datetime'] = pd.to_datetime(df1['datetime'])
df2['datetime'] = pd.to_datetime(df2['datetime'])
df3 = df1.merge(df2.groupby('userid',as_index=False).agg({'datetime' : np.min}), on='userid')
df3[df3["datetime_x"]>=df3["datetime_y"]]

输出:

    userid  datetime_x  Score   datetime_y
5   AB-4243 2016-02-06  76.00   2016-02-06
6   AB-4243 2016-02-07  84.00   2016-02-06
7   AB-4243 2016-02-08  84.00   2016-02-06
8   AB-4243 2016-02-09  81.00   2016-02-06
9   AB-4243 2016-02-10  79.00   2016-02-06
10  NP-7585 2016-02-01  22.00   2016-02-01
11  NP-7585 2016-02-02  23.50   2016-02-01
12  NP-7585 2016-02-03  30.15   2016-02-01
13  NP-7585 2016-02-04  30.15   2016-02-01
14  NP-7585 2016-02-05  30.15   2016-02-01
15  NP-7585 2016-02-06  30.15   2016-02-01
16  NP-7585 2016-02-07  0.00    2016-02-01
17  NP-7585 2016-02-08  0.00    2016-02-01
18  NP-7585 2016-02-09  22.50   2016-02-01
19  NP-7585 2016-02-10  45.67   2016-02-01
29  VX-4376 2016-02-10  33.13   2016-02-10

答案 1 :(得分:0)

我非常确定会有更全面的方式来缩短代码。但如果没有更多答案,你可以使用它:

casenum         A                 B              C                  D
  1            10                20              0                  0
  2            0                  1              2                 12
  3            10                20              1                  2
  .            .                  .              .                  .
  .            .                  .              .                  .

结果将是:

import pandas as pd
import datetime

#Read data
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')

#Format Datetime column
df1['datetime'] = df1['datetime'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y'))
df2['datetime'] = df2['datetime'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y'))

#Get min datetime in df2 for each id
min(list(df2[df2['userid']=='AB-4243']['datetime']))
temp = pd.DataFrame(list(set(df2['userid'])))
temp.columns = ['userid']
temp['min_datetime'] = temp['userid'].apply(lambda x: min(list(df2[df2['userid']==x]['datetime'])))
temp

#Merge in
df1 = df1.merge(temp, on='userid')

#Slicing
result = df1[df1['datetime'] >= df1['min_datetime']]