我有以下示例:
import pandas as pd
import numpy as np
import time
def function(value,df):
return len(df[(df['A']<value)])
df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])
start=time.time()
df['B']=pd.Series([len(df[df['A']<value]) for value in df['A']])
end=time.time()
print("list comprehension time:",end-start)
start=time.time()
df['B']=df['A'].apply(function,df=df)
end=time.time()
print("apply time:",end-start)
start=time.time()
series = []
for index, row in df.iterrows():
series.append(len(df[df['A']<row['A']]))
df['B'] = series
end=time.time()
print("loop time:",end-start)
输出:
time: 19.54859232902527
time: 23.598857402801514
time: 26.441001415252686
此示例通过计算所有值均高于该行当前值的行来创建新列。
对于这种类型的问题(当我创建一个新列时,在将数据框的所有其他行进行一行比较之后),我尝试了apply函数,列表理解和经典循环,但是我认为它们很慢。
有更快的方法吗?
Ps:这个示例的专门解决方案并不是我最感兴趣的事情。对于这种类型的问题,我更喜欢采用通用的解决方案。
另一个示例可以是:对于具有一列字符串的数据帧,通过为每一行计数该数据帧中以字符串首字母开头的字符串数来创建新列。
答案 0 :(得分:1)
通常,我正在为此类型的任务使用g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
/usr/lib/R/etc/Makeconf:176: recipe for target 'Models/Glm/PosteriorSamplers/fill_poisson_mixture_approximation_table_2.o' failed
make[1]: *** [Models/Glm/PosteriorSamplers/fill_poisson_mixture_approximation_table_2.o] Error 4
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/tmp/RtmpKl6J7B/R.INSTALL90e7ca998b2/Boom/src'
ERROR: compilation failed for package 'Boom'
* removing '/usr/local/lib/R/site-library/Boom'
The downloaded source packages are in
'/tmp/RtmpIk7UFT/downloaded_packages'
Warning message:
In install.packages("Boom") : installation of one or more packages failed,
probably 'Boom'
广播
numpy
答案 1 :(得分:0)
通常,广播作为Wen的解决方案通常是最快的。在这种情况下,看起来rank
可以完成工作。
np.random.seed(1)
df= pd.DataFrame(np.random.randint(0,100,size=(30000, 1)), columns=['A'])
%timeit df.A.rank()-1
2.71 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)