我有一个python dataframe
,有150万行和8列。我想要组合几列并创建一个新列。我知道如何做到这一点,但想知道哪一个更快更有效。我在这里复制我的代码
import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
现在这就是我想要实现的目标
df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']
另一种选择是使用pandas的apply functionality
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])
我想知道当我们有1.5万行并且必须组合8列
时,哪种方法花费的时间更少答案 0 :(得分:3)
第一种方法更快,因为它是矢量化的:
df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
print (df)
In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop
In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop
In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop
1.5M
尺寸DataFrame
,apply
方法的另一项测试非常缓慢:
#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)
In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop
In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop
In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop
答案 1 :(得分:3)