Python,在df

时间:2019-04-12 21:59:01

标签: python pandas optimization vectorization

来自C#背景(几年前)并且对Python非常陌生,我正在努力优化代码。从字面上看,for循环非常慢。

在将计算所得的列添加到Dict中的每个DataFrame的循环下面的代码中,这似乎是一个巨大的瓶颈。

我已经阅读了解决此问题的方法,例如; VectorisationNumba,但认为我没有足够的Python来真正理解和利用它们。

事实上,除了我对np.where进行的测试之外,我对两者的尝试均失败了,可能是错误的实现。这显示了我的for循环/ calc有多糟糕。

在我的工作示例中,我将省略这些尝试,但是如果需要,可以稍后添加:

import pandas as pd
import numpy as np
import datetime as date
import itertools

def points(row):
val = 0
if row['Ob2'] > 0.5:
    foo = row['Ob3'] - row['Ob1']
    if foo < 0.1:
        val = 1 - foo
else:
    val = 0
return val

print("Start: "+ str(date.datetime.now()))
print()

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*1000,\
                   'Ob1' : np.random.rand(70000),\
                   'Ob2' : np.random.rand(70000) ,\
                   'Ob3' : np.random.rand(70000)})

#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Ob1'])

print("DF fill: "+ str(date.datetime.now()))
print()

#Add test calculated column
for tbl in DataFrameDict:
    DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1) #Slow loop
    #example vectorised, hugh dif is run time
    #DataFrameDict[tbl]['Test'] = np.where((DataFrameDict[tbl]['Ob2']>0.5),1,0)

print("Calc'd: "+ str(date.datetime.now()))
print()

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

print("Fin: "+ str(date.datetime.now()))
print()

编辑:该函数添加一列,该列是每个df中两个“玩家”的比较,因此我们无法将其应用于源df。抱歉,不清楚。

我显然需要回溯并学习一些Python基础知识,但是我的老板正在等待真正的脚本,这花了3个小时来运行标准的500个“名称”(125K〜数据帧)。

如果有人可以帮助我优化它,将不胜感激!

EDIT2 :更好地表示现实问题

import pandas as pd
import numpy as np
import datetime as date
import itertools

def random_dates(start, end, n, unit='D', seed=None):
    if not seed:
        np.random.seed(0)

    ndays = (end - start).days + 1
    return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start

def points(row):
    val = 0
    if row['Names'] != row['Names2']:
        secs = row['Dates'] - row['Dates2']
        secs = secs.total_seconds()
        if secs in range(1, 301):
            val = 301 - secs
    else:
        val = 0
    return val

print("Start: "+ str(date.datetime.now()))
print()

player_list = ['player' + str(x) for x in range(1,71)]

np.random.seed(0)
start = pd.to_datetime('2019-04-01')
end = pd.to_datetime('2019-04-10')

data = pd.DataFrame({'Names': player_list*1000,
                     'Dates': random_dates(start, end, 70000)})

#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Dates'])
    DataFrameDict[key]['Names2'] = DataFrameDict[key]['Names'].shift(1)
    DataFrameDict[key]['Dates2'] = DataFrameDict[key]['Dates'].shift(1)

print("DF fill: "+ str(date.datetime.now()))
print()

#Add test calculated column
for tbl in DataFrameDict:
    DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1) #Slow loop
    #example vectorised, hugh dif is run time
    #DataFrameDict[tbl]['Test'] = np.where((DataFrameDict[tbl]['Ob2']>0.5),1,0)

print("Calc'd: "+ str(date.datetime.now()))
print()

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

print("Fin: "+ str(date.datetime.now()))
print()

我的 Solution ,由于混乱而不想在此处发布。

5 个答案:

答案 0 :(得分:2)

import pandas as pd
import numpy as np
import datetime as date
import itertools

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*1000,\
                   'Ob1' : np.random.rand(70000),\
                   'Ob2' : np.random.rand(70000) ,\
                   'Ob3' : np.random.rand(70000)})

data['Test'] = np.where(data['Ob2'] > 0.5, np.where(data['Ob3'] - data['Ob1'] < 0.1, 1 - (data['Ob3'] - data['Ob1']), 0), 0)


comboNames = list(itertools.combinations(data.Names.unique(), 2))
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Ob1'])

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

我试图保留尽可能多的代码。我将您的函数更改为使用np.where而不是apply,并在创建dict之前添加了测试列,因为正如我在评论中所表达的那样,在这一点上进行应用没有意义。

使用%%timeit时,每个循环26.2 s±1.15 s(平均±标准偏差,共运行7次,每个循环1次)

编辑:

这是我能做到的最快速度:

%%timeit

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*1000,\
                    'Ob1' : np.random.rand(70000),\
                    'Ob2' : np.random.rand(70000) ,\
                    'Ob3' : np.random.rand(70000)})

# Calculating the individual total test score for each row in data
data['test'] = np.where(data['Ob2'] > 0.5, np.where(data['Ob3'] - data['Ob1'] < 0.1, 1 - (data['Ob3'] - data['Ob1']), 0), 0)

# The goal of this function is to get the sum, and count of the test score for each player
def ScoreAndCount(row):
    score = row.sum()
    count = row.astype(bool).sum()
    return score, count

# Applying the function above, I group by each player and 
# get the total sum of test and the total count for each player.
df = data.groupby('Names')['test'].apply(ScoreAndCount).reset_index()
df = pd.concat([df['Names'], df.test.apply(pd.Series).rename(columns = {0: 'Score', 1:'Count'})], axis = 1)

# Using itertools I create a dataframe Summary that has two columns covering
# every single matchup between player, and label the columns Player1 and Player2
summary = pd.DataFrame(itertools.combinations(data.Names.unique(), 2), columns = ['Player1', 'Player2'])

# Below ,I merge summary with my dataframe that contains the total score and count 
# for each player. Every single time there is a player1 in the Player1 column it
# will merge the their total score and count, the same is then done for the 
# players in the Player2 column. After these merges I have 6 columns. The two 
# player columns, and the total scores and counts for both individuals.
summary = summary.merge(df, left_on = 'Player1', right_on = 'Names')\
                 .merge(df, left_on = 'Player2', right_on = 'Names')\
                 .drop(columns = ['Names_x', 'Names_y'])

# Below, I add the players 'Score' and 'Count' columns to get the total score
# and total count per iteration. Then I clean the df dropping the columns that 
# are not needed, and sorting by score.
summary['Score'] = summary['Score_x'] + summary['Score_y']
summary['Count'] = summary['Count_x'] + summary['Count_y']
summary.drop(columns = ['Score_x','Count_x', 'Score_y','Count_y'], inplace = True)
summary.sort_values('Score', ascending = False)

157 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

我的目标是不使用循环或命令来进一步提高速度。

我的函数ScoreAndCount返回每个玩家的得分和计数。 pd.concat将返回该函数并将其添加到我们的初始df中。

然后我使用了ittertools组合,并使其成为自己的数据框,称为Summary。然后,我将摘要df的player1和player2列与原始df中的名称列合并。

接下来,我将球员的得分和总数加起来,删除不必要的列并进行排序。我最终每个循环157ms。最慢的步骤是合并和合并,但是我想不出办法解决它们并进一步提高速度。

EDIT3

我们将为两个测试设置种子并使用相同的数据df:

np.random.seed(0)

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*10,\
                    'Ob1' : np.random.rand(700),\
                    'Ob2' : np.random.rand(700) ,\
                    'Ob3' : np.random.rand(700)})

data.head()

      Names    Ob1         Ob2         Ob3
0   player1 0.548814    0.373216    0.313591
1   player2 0.715189    0.222864    0.365539
2   player3 0.602763    0.080532    0.201267
3   player4 0.544883    0.085311    0.487148
4   player5 0.423655    0.221396    0.990369

接下来,我们将使用您的确切代码,并检查player1和player2之间的字典。


def points(row):
    val = 0
    if row['Ob2'] > 0.5:
        foo = row['Ob3'] - row['Ob1']
        if foo < 0.1:
            val = 1 - foo
    else:
        val = 0
    return val

#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Ob1'])

#Add test calculated column
for tbl in DataFrameDict:
    DataFrameDict[tbl]['Test'] = DataFrameDict[tbl].apply(points, axis=1)

DataFrameDict[('player1', 'player2')].head()

     Names     Ob1        Ob2          Ob3     Test
351 player2 0.035362    0.013509    0.384273    0.0
630 player1 0.062636    0.305047    0.571550    0.0
561 player2 0.133461    0.758194    0.964210    0.0
211 player2 0.216897    0.056877    0.417333    0.0
631 player2 0.241902    0.557987    0.983555    0.0

接下来,我们将在摘要中执行您的操作并获取测试列的总和,这将是为玩家1和玩家2生成的分数

DataFrameDict[('player1', 'player2')]['Test'].sum()

8.077455441105938

所以我们最终得到8.0774。现在,如果我说的是真的,那么如果我们在Edit2中执行代码,则将获得Player1和Player2之间的得分为8.077。

data['test'] = np.where(data['Ob2'] > 0.5, np.where(data['Ob3'] - data['Ob1'] < 0.1, 1 - (data['Ob3'] - data['Ob1']), 0), 0)

def ScoreAndCount(row):
    score = row.sum()
    count = row.astype(bool).sum()
    return score, count

df = data.groupby('Names')['test'].apply(ScoreAndCount).reset_index()
df = pd.concat([df['Names'], df.test.apply(pd.Series).rename(columns = {0: 'Score', 1:'Count'})], axis = 1)

summary = pd.DataFrame(itertools.combinations(data.Names.unique(), 2), columns = ['Player1', 'Player2'])
summary = summary.merge(df, left_on = 'Player1', right_on = 'Names')\
                 .merge(df, left_on = 'Player2', right_on = 'Names')\
                 .drop(columns = ['Names_x', 'Names_y'])

summary['Score'] = summary['Score_x'] + summary['Score_y']
summary['Count'] = summary['Count_x'] + summary['Count_y']
summary.drop(columns = ['Score_x','Count_x', 'Score_y','Count_y'], inplace = True)
summary = summary.sort_values('Score', ascending = False)

现在,我们将检查玩家1和玩家2所在的行

summary[(summary['Player1'] == 'player1')&(summary['Player2'] == 'player2')]

    Player1 Player2   Score    Count
0   player1 player2 8.077455    6.0

如您所见,我计算出的玩家1玩家2和我的edit2得分完全相同,就像您在代码中所做的一样。

答案 1 :(得分:1)

我能够使用numba将您的函数向量化,并且生成的代码在%8%timeit的情况下在大约8秒内运行。我遵循Ben Pap的建议,并预先计算了测试列。我还预先对值进行了排序,并整理了DataFrameDict的创建。


%%timeit
import pandas as pd
import numpy as np
import datetime as date
import itertools
import numba

@numba.vectorize
def points(a,b,c):
    val = 0
    if b > 0.5:
        foo = c - a
        if foo < 0.1:
            val = 1 - foo
    else:
        val = 0
    return val

player_list = ['player' + str(x) for x in range(1,71)]

data = pd.DataFrame({'Names': player_list*1000,\
                   'Ob1' : np.random.rand(70000),\
                   'Ob2' : np.random.rand(70000) ,\
                   'Ob3' : np.random.rand(70000)})

data['Test'] = points(data['Ob1'].values,data['Ob2'].values,data['Ob3'].values)
data = data.sort_values(['Ob1'])

comboNames = list(itertools.combinations(data.Names.unique(), 2))
DataFrameDict = {elem : data.loc[data.Names.isin(elem)] for elem in comboNames}

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

每个循环8.52 s±204 ms(平均±标准偏差,共运行7次,每个循环1次)

答案 2 :(得分:1)

我专注于您的函数point和调用apply的for循环。

函数Point可以转换为这种条件(a_dfDataFrameDict中的每个DataFrame):

(a_df['Ob2'] > 0.5) & (a_df['Ob3'] - a_df['Ob1'] < 0.01)

在这种情况下,将值1 - x['Ob3'] + x['Ob1']分配到Test列。其他所有内容均将0分配给Test。因此,让我们为每个Test分配新列a_df。然后,仅筛选符合上述条件的行,以缩小数据集并为此子集设置新值。最后,将此子集Test的列值更新回a_df ['Test']并将其分配回DataFrameDict字典。因此,您的for循环将变为:

for tbl in DataFrameDict:        
    a_df = DataFrameDict[tbl].assign(Test=0)
    a_df['Test'].update(a_df[(a_df['Ob2'] > 0.5) & (a_df['Ob3'] - a_df['Ob1'] < 0.01)].assign(Test=lambda x: 1 - x['Ob3'] + x['Ob1'])['Test'])
    DataFrameDict[tbl] = a_df

运行速度很快

输出DataFrameDict的每个DataFrame根据指定条件填充了Test列。我从DataFrameDict中选择了一个最终的DataFrame来显示输出。

In [1288]: DataFrameDict[('player65', 'player67')]
Out[1288]:
          Names       Ob1       Ob2       Ob3      Test
61456  player67  0.000271  0.686051  0.729086  0.000000
25824  player65  0.001281  0.505552  0.296550  0.000000
25544  player65  0.001398  0.770805  0.471477  0.000000
65864  player65  0.001999  0.147407  0.291841  0.000000
33104  player65  0.002661  0.254329  0.126290  0.000000
42554  player65  0.003172  0.529603  0.181796  0.000000
28064  player65  0.003663  0.227429  0.558233  0.000000
24844  player65  0.005517  0.096817  0.710771  0.000000
2584   player65  0.005974  0.338904  0.582034  0.000000
42694  player65  0.005996  0.171637  0.765277  0.000000
6154   player65  0.006126  0.181239  0.295149  0.000000
65234  player65  0.008386  0.180613  0.994273  0.000000
5034   player65  0.008921  0.013060  0.305063  0.000000
21766  player67  0.010950  0.590966  0.481547  0.000000
53054  player65  0.010957  0.731794  0.262754  0.000000
15956  player67  0.010996  0.046718  0.153172  0.000000
36046  player67  0.011634  0.250039  0.064184  0.000000
50394  player65  0.011835  0.995986  0.834281  0.000000
64326  player67  0.011974  0.499262  0.745194  0.000000
30236  player67  0.013029  0.101714  0.143509  0.000000
23374  player65  0.014865  0.158185  0.575582  0.000000
1256   player67  0.014915  0.938301  0.629850  0.000000
10216  player67  0.015122  0.450750  0.137085  0.000000
21904  player65  0.016372  0.147897  0.786882  0.000000
34854  player65  0.016603  0.513692  0.676243  0.000000
33806  player67  0.016820  0.063896  0.577731  0.000000
29816  player67  0.017565  0.060496  0.151780  0.000000
6924   player65  0.017652  0.121581  0.117512  0.000000
39126  player67  0.017990  0.516819  0.663672  0.000000
39896  player67  0.018085  0.031526  0.075832  0.000000
...         ...       ...       ...       ...       ...
61526  player67  0.985386  0.512073  0.754241  1.231145
48926  player67  0.985504  0.007080  0.671456  0.000000
16234  player65  0.985775  0.846647  0.998181  0.000000
12736  player67  0.985846  0.283997  0.667314  0.000000
47874  player65  0.986084  0.052026  0.508918  0.000000
29886  player67  0.986655  0.998440  0.068136  1.918518
49416  player67  0.986706  0.833053  0.182814  1.803892
42486  player67  0.986797  0.608128  0.136219  1.850578
55644  player65  0.987796  0.215898  0.561002  0.000000
1814   player65  0.987935  0.324954  0.525433  0.000000
7554   player65  0.988910  0.664914  0.674546  1.314365
59774  player65  0.989147  0.235214  0.913588  0.000000
58444  player65  0.989467  0.645191  0.533468  1.455999
62856  player67  0.989470  0.523544  0.302838  1.686632
48646  player67  0.990588  0.522521  0.201132  1.789456
11336  player67  0.990629  0.932360  0.756544  1.234085
31774  player65  0.990881  0.981641  0.943824  1.047057
18964  player65  0.992287  0.808989  0.948321  1.043967
14486  player67  0.992909  0.437701  0.484678  0.000000
12246  player67  0.994027  0.542903  0.234830  1.759197
33596  player67  0.994257  0.949055  0.098368  1.895889
6436   player67  0.994661  0.444211  0.572136  0.000000
4194   player65  0.995022  0.721113  0.584195  1.410826
42696  player67  0.995065  0.516103  0.918737  1.076328
51026  player67  0.995864  0.877335  0.516737  1.479127
14136  player67  0.997691  0.134021  0.913969  0.000000
47664  player65  0.998051  0.628051  0.722695  1.275357
55924  player65  0.998079  0.828749  0.151217  1.846863
18474  player65  0.998780  0.200990  0.098713  0.000000
41296  player67  0.998884  0.167139  0.504899  0.000000

[2000 rows x 5 columns]

答案 3 :(得分:0)

第6至14行的缩进量是否缩小?

def points(row):
    val = 0
    if row['Ob2'] > 0.5:
        foo = row['Ob3'] - row['Ob1']
        if foo < 0.1:
            val = 1 - foo
    else:
        val = 0
    return val

答案 4 :(得分:0)

部分受@andy回答的影响,谢谢。与我的Edit2代码相比,大大减少了运行时间。

删除了函数,并使用np.where进行了所有计算:

import pandas as pd
import numpy as np
import datetime as date
import itertools

def random_dates(start, end, n, unit='D', seed=None):
    if not seed:
        np.random.seed(0)

    ndays = (end - start).days + 1
    return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start

print("Start: "+ str(date.datetime.now()))
print()

player_list = ['player' + str(x) for x in range(1,71)]

np.random.seed(0)
start = pd.to_datetime('2019-04-01')
end = pd.to_datetime('2019-04-10')

data = pd.DataFrame({'Names': player_list*1000,
                     'Dates': random_dates(start, end, 70000)})

#create list of unique pairs
comboNames = list(itertools.combinations(data.Names.unique(), 2))

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in comboNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names.isin(key)]
    DataFrameDict[key] = DataFrameDict[key].sort_values(['Dates'])
    seconds = (DataFrameDict[key]['Dates'] - DataFrameDict[key]['Dates'].shift(1))/ np.timedelta64(1,'s')
    DataFrameDict[key]['Test'] = np.where((DataFrameDict[key]['Names'] != DataFrameDict[key]['Names'].shift(1))&\
                                              (np.logical_and(seconds>=1, seconds<=301)), 301-seconds,0).astype(np.uint8)

print("DF fill: "+ str(date.datetime.now()))
print()

headers = ['Player1','Player2','Score','Count']
summary = pd.DataFrame(([tbl[0], tbl[1], DataFrameDict[tbl]['Test'].sum(),
                      DataFrameDict[tbl]['Test'].astype(bool).sum(axis=0)] for tbl in DataFrameDict),
                      columns=headers).sort_values(['Score'], ascending=[False])

print("Fin: "+ str(date.datetime.now()))
print()