s=[]
for idx2,source in df.iterrows():
num_flux=len(df[df['flux_radio']>source['flux_match']])
surface_density=num_flux/area
s.append(1-np.exp(-1*np.pi*source['Separation']**2*surface_density))
df['s']=s
我正在尝试将此 for 循环转换为矢量化。数据框看起来像这样。
flux_match | 分隔 | flux_radio | |
---|---|---|---|
... | ... | ... | ... |
5000 | 22.2999 | 2.0229 | 11.8 |
5001 | 33.2999 | 3.3546 | 22.3 |
5002 | 44.2999 | 4.08002 | 13.7 |
5003 | 17.4001 | 3.4419 | 13.6 |
5004 | 53.7999 | 4.3195 | 18.9 |
... | ... | ... | ... |
对于每个 'flux_match',我们试图找出有多少个 'flux_radio' 更大,并用它计算统计数据。
我用过:
def func(radio, match, distance, area=6228*5940):
num_flux = len(radio > match)
print(num_flux)
surface_density= num_flux/area
s= 1-np.exp(-1*np.pi*distance**2*surface_density)
return s
df['s']= func(
df['flux_radio'].values,
df['flux_match'].values,
df['Separation'].values
)
但这给出了错误的值,因为它只计算 'num_flux' 一次。 我们想为每个“flux_match”找到“num_flux”。 由于正在使用的数据很大,因此我们将不胜感激任何有关更快执行此操作的方法的建议。
答案 0 :(得分:0)
这是一个解决方案:
由于您需要将每个 flux_match
与所有 flux_radio
进行比较,让我们首先为每一行分配一个所有 flux_radio
的完整集合。然后 df.explode
展开 dataframe
以供以后操作:
数据
data = np.array([[22.2999, 2.0229, 11.8],
[33.2999, 3.3546, 22.3],
[44.2999, 4.08002, 13.7],
[17.4001, 3.4419, 13.6],
[53.7999,4.3195,18.9]])
df = pd.DataFrame(data=data, columns=['flux_match', 'Separation', 'flux_radio'])
area = 6228*5940
你的代码给出的结果
flux_match Separation flux_radio s
0 22.2999 2.02290 11.8 3.475070e-07
1 33.2999 3.35460 22.3 0.000000e+00
2 44.2999 4.08002 13.7 0.000000e+00
3 17.4001 3.44190 13.6 2.012060e-06
4 53.7999 4.31950 18.9 0.000000e+00
我的解决方法:
# Assign each row a complete collection of 'flux_radio'
df["all_flux_radio"] = np.tile(df['flux_radio'].to_numpy(), (df.shape[0],1)).tolist()
df_exp = df.explode('all_flux_radio')
df_exp['indicator'] = df_exp['all_flux_radio'] > df_exp['flux_match']
# Compute density
s = df_exp.groupby('flux_match').agg(surface_density=("indicator", "sum")) /area
# Merge density back to the df
df = pd.concat([df.set_index('flux_match'), s], axis=1)
df['s'] = 1 - np.exp(-1*np.pi*df['Separation']**2*df['surface_density'])
# Clean up helper columns
df.reset_index().drop(columns=["all_flux_radio", "surface_density"])
就 s
而言,此代码为我提供了与您的第一个代码片段相同的结果。喜欢如下
flux_match Separation flux_radio s
0 17.4001 3.44190 13.6 2.012060e-06
1 22.2999 2.02290 11.8 3.475070e-07
2 33.2999 3.35460 22.3 0.000000e+00
3 44.2999 4.08002 13.7 0.000000e+00
4 53.7999 4.31950 18.9 0.000000e+00