使用groupand Pandas函数计算范围所涵盖的天数百分比

时间:2016-11-04 18:23:12

标签: python python-2.7 pandas dataframe

我有一个简单的Pandas dataframe,其中每一行代表一个人和一个日期范围。对于每个人,我想知道period_start中各个条目涵盖硬编码范围(由变量period_enddataframe定义)的天数百分比。

我假设有一种简单的方法可以使用Pandas执行此操作,但我找不到一个。我有一个包含多个dataframes和几个嵌套循环的解决方案,但这在规模上效率很低。如何更有效地利用Pandas来做到这一点?我认为groupby是有意义的,但是当范围跨越两列并且可能重叠时,不确定如何做到这一点。

import pandas as pd
from datetime import datetime
df = pd.DataFrame(data=[['2016-01-01', '2016-01-31', 'A'],
                        ['2016-02-02', '2016-02-10', 'A'],
                        ['2016-03-01', '2016-04-01', 'A'],
                        ['2016-01-01', '2016-03-01', 'B']],
                  columns=['startdate', 'enddate', 'person'])

# start and end date
period_start = datetime(year=2016, month=01, day=01)
period_end = datetime(year=2016, month=12, day=31)

# dates_dfculate totals days
total_days =  (period_end-period_start).days  + 1

# convert columns to dates
df['startdate']= pd.to_datetime(df['startdate'],  format='%Y-%m-%d')
df['enddate']= pd.to_datetime(df['enddate'],  format='%Y-%m-%d')

# create a TimeIndex dataframe with columns for each person
rng = pd.date_range(period_start, periods=total_days, freq='D')
people = list(set(df['person'].tolist()))
dates_df = pd.DataFrame(columns=[people], index=rng).fillna(False)

# loop over each date (index)
for index, row in dates_df.iterrows():

   # loop over each column (person) 
   for person in people:
       tmp = df[df['person'] == person]

       # loop over each each entry for the person
       for index1, row1 in tmp.iterrows():

           # check if the date is date index in dates_df is within range
           value = row1['startdate'] <= index <= row1['enddate']

           # if it's not already set to true, set it to true
           if dates_df.ix[index, person] == False and value == True:
               dates_df.ix[index, person] = True

# for each person, show the percentage of days in range that are covered
for person in people:
    print  person, sum(dates_df[person].tolist())/float(total_days)

期望的输出:

A 0.196721311475
B 0.166666666667

1 个答案:

答案 0 :(得分:1)

这应该是它,我猜你是因为你想要在范围内包含总数的1,但是根据需要进行编辑:)

import pandas as pd
from datetime import datetime

df = pd.DataFrame(data=[['2016-01-01', '2016-01-31', 'A'],
                        ['2016-02-02', '2016-02-10', 'A'],
                        ['2016-03-01', '2016-04-01', 'A'],
                        ['2016-01-01', '2016-03-01', 'B']],
                  columns=['startdate', 'enddate', 'person'])

# start and end date
period_start = datetime(year=2016, month=1, day=1)
period_end = datetime(year=2016, month=12, day=31)

# convert columns to dates
df['startdate']= pd.to_datetime(df['startdate'],  format='%Y-%m-%d')
df['enddate']= pd.to_datetime(df['enddate'],  format='%Y-%m-%d')
df['days'] = df.apply(lambda x: max((min(x.enddate, period_end) - max(x.startdate, period_start)).days + 1, 0), axis=1)

#percentage of days in range by person
people_pct = df.groupby('person').apply(lambda x: x.days.sum() / ((period_end - period_start).days + 1))
print(people_pct.head())

-----------------
    person
    A    0.196721
    B    0.166667

你走在正确的轨道上 - pandas groupby非常适合分割数据,但真正的力量来自.apply()函数,它可以进行常见的数学转换(mean,std,等),或者,在这种情况下,自定义功能。

应用中的lambda说“对于组内的每一行/列(取决于axis),执行此自定义函数并返回Series”。

虽然这涵盖了您的问题,但它仍然缺乏检测独特日期,所以我们假设行被分割而没有重叠,如您所示。