Question

我正在尝试按照另一列中的数据对一列中的数据进行分组，但我只想要来自特定时间范围的数据。所以我们说2015-11-1到2016-4-30。我的数据库看起来像这样：

account_id    employer_key    login_date
1111111       google          2016-03-03 20:58:36.000000
2222222       walmart         2015-11-18 11:52:56.000000
2222222       walmart         2015-11-18 11:53:14.000000
1111111       google          2016-04-06 23:29:04.000000
3333333       dell_inc        2015-09-05 14:13:53.000000
3333333       dell_inc        2016-01-28 03:20:58.000000
2222222       walmart         2015-09-03 00:11:38.000000
1111111       google          2015-09-03 00:12:25.000000
1111111       google          2015-11-13 01:59:59.000000
4444444       google          2015-11-13 01:59:59.000000
5555555       dell_inc        2015-03-12 01:59:59.000000

我正在尝试获得一个看起来像这样的输出（如果该人在该时间窗口内登录时只显示1或者为真，如果他们没有，则显示0或false）：

employer_key  account_id   login_date
google        1111111       1
              4444444       1
walmart       2222222       1
dell_inc      3333333       1
dell_inc      5555555       0

我该怎么做呢？

Answer 1

你可以这样做：

In [252]: df.groupby(['employer_key','account_id']) \
     ...:   .apply(lambda x: len(x.query("'2015-11-01' <= login_date <= '2016-04-30'")) > 0) \
     ...:   .reset_index()
Out[252]:
  employer_key  account_id      0
0     dell_inc     3333333   True
1     dell_inc     5555555  False
2       google     1111111   True
3       google     4444444   True
4      walmart     2222222   True

或使用boolean indexing：

In [249]: df.groupby(['employer_key','account_id'])['login_date'] \
     ...:   .apply(lambda x: len(x[x.ge('2015-11-01') & x.le('2016-04-30')]) > 0)
Out[249]:
employer_key  account_id
dell_inc      3333333        True
              5555555       False
google        1111111        True
              4444444        True
walmart       2222222        True
Name: login_date, dtype: bool

或另外使用reset_index()：

In [250]: df.groupby(['employer_key','account_id'])['login_date'] \
     ...:   .apply(lambda x: len(x[x.ge('2015-11-01') & x.le('2016-04-30')]) > 0) \
     ...:   .reset_index()
Out[250]:
  employer_key  account_id login_date
0     dell_inc     3333333       True
1     dell_inc     5555555      False
2       google     1111111       True
3       google     4444444       True
4      walmart     2222222       True

Answer 2

使用between标记并groupby + max获取行。

s = df.set_index(['employer_key', 'account_id']).login_date
flag = s.between('2015-11-01', '2016-04-30').astype(np.uint8)
flag.groupby(level=[0, 1]).max().reset_index()

  employer_key  account_id  login_date
0     dell_inc     3333333           1
1     dell_inc     5555555           0
2       google     1111111           1
3       google     4444444           1
4      walmart     2222222           1

如何按日期范围分组

2 个答案: