Python多类别比率

时间:2019-03-01 21:10:29

标签: python pandas numpy

我有一个这个项目,每年需要像下面这样的每年的每周比率

Week | Year | Weekday | Volume 1 2000 1 0 1 2000 2 10 1 2000 3 10 2 2000 1 10 2 2000 2 0 1 2001 1 0 1 2001 2 10 1 2001 3 10 2 2001 1 10 2 2001 2 0

我希望输出是这样的

Week | Year | Weekday | Volume | Ratio 1 2000 1 0 0 1 2000 2 10 .5 1 2000 3 10 .5 2 2000 1 10 1 2 2000 2 0 0 1 2001 1 0 0 1 2001 2 10 .5 1 2001 3 10 .5 2 2001 1 10 1 2 2001 2 0 0

我有一个当前的解决方案,其功能与此类似

for year in years
    for week in weeks 
        ratio = week/weeklytotal
        weeklyratios = pd.concat([weeklyratios,ratio], blablabla)

这个问题是效率非常低,尤其是因为我必须执行超过3万次的过程。它最终导致2.3秒的运行时间,相当于24小时的代码运行时间。

是否有更好的方法可以使其运行更快?

2 个答案:

答案 0 :(得分:0)

您可以使用groupby来计算每周的总交易量。然后,您可以将总体积与原始数据框合并,并以矢量化方式计算比率。

假设原始数据帧为df(dtype为int):

   Week  Year  Weekday  Volume

0     1  2000        1       0
1     1  2000        2      10
2     1  2000        3      10
3     2  2000        1      10
4     2  2000        2       0
5     1  2001        1       0
6     1  2001        2      10
7     1  2001        3      10
8     2  2001        1      10
9     2  2001        2       0

您可以使用:

s = df.groupby(['Week', 'Year']).sum().drop('Weekday', axis=1)
df2 = df.set_index(['Week', 'Year']).join(s,rsuffix='_tot').sort_index(level=1)
df2['ratio'] = df2.Volume / df2.Volume_tot

print(df2)

给予:

           Weekday  Volume  Volume_tot  ratio
Week Year                                    
1    2000        1       0          20    0.0
     2000        2      10          20    0.5
     2000        3      10          20    0.5
2    2000        1      10          10    1.0
     2000        2       0          10    0.0
1    2001        1       0          20    0.0
     2001        2      10          20    0.5
     2001        3      10          20    0.5
2    2001        1      10          10    1.0
     2001        2       0          10    0.0

您可以通过以下方式获得预期的输出:

print(df2.drop('Volume_tot', axis=1).reset_index())

给出:

   Week  Year  Weekday  Volume  ratio
0     1  2000        1       0    0.0
1     1  2000        2      10    0.5
2     1  2000        3      10    0.5
3     2  2000        1      10    1.0
4     2  2000        2       0    0.0
5     1  2001        1       0    0.0
6     1  2001        2      10    0.5
7     1  2001        3      10    0.5
8     2  2001        1      10    1.0
9     2  2001        2       0    0.0

答案 1 :(得分:0)

您可以使用熊猫中的索引和groupby功能执行分组操作。

假设您有一个数据列df,其列为['week','year','weekday','volume'],则您的解决方案将如下所示:

import numpy as np
import pandas as pd
import timeit as t

# make up some data, only 1000 groups not your 30000, but it gets the point across
dates = pd.date_range(start = '2000-01-01', end = '2019-02-28', freq = 'D')
volume = np.random.randint(0,100,len(dates))

df = pd.DataFrame(list(zip(dates.week,dates.year,dates.dayofweek,volume)),
                  columns = ['week','year','weekday','volume'])

# group
grp = df.groupby(['year','week'])
grp_vol = grp['volume'].sum()

# rename to avoid overlap in names
grp_vol.name  = 'weekly_volume'
# rejoin to calculate your ratio
df = df.join(grp_vol, on = ['year','week'])
df['ratio'] = df['volume']/df['weekly_volume']

然后定时进行测量

%timeit df['ratio'] = df['volume']/df['weekly_volume']
196 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

因此少于24小时。