我正在尝试从现有数据框创建二进制数据,但是这需要很长时间才能完成。有没有更快的方法可以做到这一点?
我现在拥有的是一个具有多行的数据框,例如df:
Index Actions Tries Ratio 0 20 200 0,1 1 10 400 0,025 2 15 500 0,03 3 30 700 0,04
我现在想将此数据转换为二进制数据,例如df_binary
Index_old Index_new Actions Tries Ratio Success 0 0 20 200 0,1 1 0 1 20 200 0,1 1 0 2 20 200 0,1 1 0 3 20 200 0,1 1 ... 0 19 20 200 0,1 1 -> 20 times success(1) 0 20 20 200 0,1 0 0 21 20 200 0,1 0 0 22 20 200 0,1 0 ... 0 199 20 200 0,1 0 -> 200-20= 180 times fail(0) 1 200 10 400 0,025 1 1 201 10 400 0,025 1 1 202 10 400 0,025 1
从上面的示例可以看出,动作/尝试=比率。应复制的次数基于Tries,成功= 1的次数基于Action。成功= 0的次数基于“尝试-动作”。
import pandas as pd
#create the new DataFrame
df_binary = pd.DataFrame()
#iterate over all rows in the original DataFrame (df)
for index,row in df.iterrows():
#get the number of tries from the row in the df
tries = row['Tries']
#get the number of actions from the row in the df
actions = row['Actions']
#calculate the number of times the tries did not result in action
noActions = tries - actions
#create a temporary df used for appending
tempDf = row
#loop for the range given by tries (row['Tries']) e.g. loop 200 times
for try in range(tries):
if try < actions:
#if the number of actions is lower than tries, set success to 1. E.g. try 1 < 20, set success, try 15 < 20, set success
tempDf['Success'] = 1
#append new data to df_binary
df_binary = df_binary.append(tempDf, ignore_index=True)
else:
#else set success to failure, e.g. try 25 > 20 set failure, try 180 > 20 set failure.
tempDf['Success'] = 0
#append new data to df_binary
df_binary = df_binary.append(tempDf, ignore_index=True)
在此示例中,完成时间不会那么长。但是我实际的新df_binary在完成后应该包含约1500万行,并包含更多列,这需要很长时间才能完成。
有什么方法可以更快地做到这一点?
谢谢!
答案 0 :(得分:1)
以下是在列表理解中使用pandas.concat
,Series.repeat
和DataFrame.assign
的一种潜在方法:
successes = np.concatenate([[1]*a + [0]*(t-a) for a, t in zip(df['Actions'], df['Tries'])])
df_binary = (pd.concat([df[s].repeat(df['Tries']) for s in df], axis=1)
.assign(success=successes).reset_index())