Pandas按日期过滤CSV

时间:2017-04-11 11:29:50

标签: python python-3.x csv pandas

Pandas按日期过滤 如何使用日期过滤CSV

示例CSV

User    Dates       Hours   shift
User1   01.01.2012      5   aaa 
User1   02.01.2012      5   aaa
User1   03.01.2012      2   bbb
User1   04.01.2012      3   aaa
.....
User1   12.03.2012      1   aaa
User1   13.03.2012      8   ccc
.....
User2   04.02.2012      4   aaa
User2   05.02.2012      3   bbb

结束

我可以通过

用户进行过滤
use = users.loc["User1"]

我也可以总结所有时间

print(use["Hours"].sum()

我可以算他的班次

counts = use.loc[ou['Shift'] == 'aaa', 'Hours'].value_counts()

但我不知道如何按日期和上述陈述进行过滤。 比如用户2计算3月份的所有班次,或者按用户1计算2月份完成的所有小时数

或多或少我管理了按日期和用户过滤表

use['Date'] = pd.to_datetime(use['Date'], infer_datetime_format=True, exact=True)
mask = (use['Datum'] > Start) & (use['Date'] <= End)
print(use.loc[mask])

但我无法弄清楚如何将它们结合起来。 期望的输出

Overview March 2016
User1 made 3 aaa shifts
User1 worked 12h in March 2016

更新: 我取得了一些进展

print(use[use['Date'] > '02.01.2012'],['hours'].sum()))

工作正常,但不是我想要的。用:

print(use[use['Date'] > '02.01.2012'] & (use[use['Date'] < '02.05.2012'],['hours'].sum()))

我得到了

AttributeError: 'list' object has no attribute 'sum'

2 个答案:

答案 0 :(得分:1)

我认为你可以使用:

Start = '2012-01-01'
End = '2012-03-03'
use['Dates'] = pd.to_datetime(use['Dates'], dayfirst=True)
mask = (use['Dates'] > Start) & (use['Dates'] <= End) & (use['shift'] == 'aaa')
use1 = use.loc[mask]
print (use1)
    User      Dates  Hours shift
1  User1 2012-01-02      5   aaa
3  User1 2012-01-04      3   aaa
6  User2 2012-02-04      4   aaa

use1 = use.query('Dates > @Start and Dates <= @End and shift == "aaa"')
print (use1)
    User      Dates  Hours shift
1  User1 2012-01-02      5   aaa
3  User1 2012-01-04      3   aaa
6  User2 2012-02-04      4   aaa

print (mask.sum())
3
counts = use.loc[mask, 'Hours'].value_counts()
print (counts)
3    1
5    1
4    1
Name: Hours, dtype: int64

编辑:

Start = '2012-01-01'
End = '2012-03-03'
use['Dates'] = pd.to_datetime(use['Dates'], dayfirst=True)
mask = (use['Dates'] > Start) & (use['Dates'] <= End)
use1 = use.loc[mask]
print (use1)
    User      Dates  Hours shift
1  User1 2012-01-02      5   aaa
2  User1 2012-01-03      2   bbb
3  User1 2012-01-04      3   aaa
6  User2 2012-02-04      4   aaa
7  User2 2012-02-05      3   bbb


counts = use1.groupby(['User','shift'])['Hours'].agg({'SUM':'sum', 'COUNT':'size'})
             .reset_index()
print (counts)
    User shift  SUM  COUNT
0  User1   aaa    8      2
1  User1   bbb    2      1
2  User2   aaa    4      1
3  User2   bbb    3      1

EDIT1:

如果需要更多条件,请使用loc

print(use.loc[(use['Date'] > '02.01.2012') & (use['Date'] < '02.05.2012'),'hours'].sum())
0

所有在一起:

use = pd.DataFrame({'Date': ['01.01.2012', '02.01.2012', '03.01.2012', '04.01.2012', '12.03.2012', '13.03.2012', '04.02.2012', '05.02.2012'], 'User': ['User1', 'User1', 'User1', 'User1', 'User1', 'User1', 'User2', 'User2'], 'hours': [5, 5, 2, 3, 1, 8, 4, 3], 'shift': ['aaa', 'aaa', 'bbb', 'aaa', 'aaa', 'ccc', 'aaa', 'bbb']})
print (use)

    User        Date  hours shift
0  User1  01.01.2012      5   aaa
1  User1  02.01.2012      5   aaa
2  User1  03.01.2012      2   bbb
3  User1  04.01.2012      3   aaa
4  User1  12.03.2012      1   aaa
5  User1  13.03.2012      8   ccc
6  User2  04.02.2012      4   aaa
7  User2  05.02.2012      3   bbb
Start = '2012-01-01'
End = '2012-01-30'
User = 'User1'
shift = 'aaa'

use['Date'] = pd.to_datetime(use['Date'], dayfirst=True)

#how many Hours by dates (sum)
print(use.loc[(use['Date'] > Start) & (use['Date'] < End),'hours'].sum())
10

#how many Hours by dates and user (sum)
print(use.loc[(use['Date'] > Start) & (use['Date'] < End) & 
              (use['User'] == User),'hours'].sum())
10

#how many Hours by dates and user (count)
print(((use['Date'] > Start) & (use['Date'] < End) & 
       (use['User'] == User)).sum())
3

#how many Hours by dates and user and shift (count)
print(((use['Date'] > Start) & (use['Date'] < End) & 
       (use['User'] == User ) & (use['shift'] == shift)).sum())
2

答案 1 :(得分:0)

在运行聚合之前,您需要缩小数据集范围。

use[use['Dates'] == '01.01.2012']['hours'].sum()

该行的第一部分是过滤:

use[use['Dates'] == '01.01.2012']