熊猫-从现有数据创建二进制数据

时间:2018-09-04 09:50:56

标签: python pandas

我正在尝试从现有数据框创建二进制数据,但是这需要很长时间才能完成。有没有更快的方法可以做到这一点?

我现在拥有的是一个具有多行的数据框,例如df:

Index   Actions Tries   Ratio
0       20      200     0,1
1       10      400     0,025
2       15      500     0,03
3       30      700     0,04

我现在想将此数据转换为二进制数据,例如df_binary

Index_old   Index_new   Actions Tries   Ratio   Success
0           0           20      200     0,1     1
0           1           20      200     0,1     1
0           2           20      200     0,1     1
0           3           20      200     0,1     1
...     
0           19          20      200     0,1     1  -> 20 times success(1)   
0           20          20      200     0,1     0
0           21          20      200     0,1     0
0           22          20      200     0,1     0
...                 
0           199         20      200     0,1     0  -> 200-20= 180 times fail(0)
1           200         10      400     0,025   1
1           201         10      400     0,025   1
1           202         10      400     0,025   1

从上面的示例可以看出,动作/尝试=比率。应复制的次数基于Tries,成功= 1的次数基于Action。成功= 0的次数基于“尝试-动作”。

import pandas as pd
#create the new DataFrame
df_binary = pd.DataFrame()
#iterate over all rows in the original DataFrame (df)
for index,row in df.iterrows():
    #get the number of tries from the row in the df
    tries = row['Tries']
    #get the number of actions from the row in the df
    actions = row['Actions']
    #calculate the number of times the tries did not result in action
    noActions = tries - actions
    #create a temporary df used for appending
    tempDf = row

    #loop for the range given by tries (row['Tries']) e.g. loop 200 times      
    for try in range(tries):  
        if try < actions:
            #if the number of actions is lower than tries, set success to 1. E.g. try 1 < 20, set success, try 15 < 20, set success
            tempDf['Success'] = 1
            #append new data to df_binary
            df_binary = df_binary.append(tempDf, ignore_index=True)
        else:
            #else set success to failure, e.g. try 25 > 20 set failure, try 180 > 20 set failure.
            tempDf['Success'] = 0
            #append new data to df_binary
            df_binary = df_binary.append(tempDf, ignore_index=True)

在此示例中,完成时间不会那么长。但是我实际的新df_binary在完成后应该包含约1500万行,并包含更多列,这需要很长时间才能完成。

有什么方法可以更快地做到这一点?

谢谢!

1 个答案:

答案 0 :(得分:1)

以下是在列表理解中使用pandas.concatSeries.repeatDataFrame.assign的一种潜在方法:

successes = np.concatenate([[1]*a + [0]*(t-a) for a, t in zip(df['Actions'], df['Tries'])])

df_binary = (pd.concat([df[s].repeat(df['Tries']) for s in df], axis=1)
             .assign(success=successes).reset_index())