在每一行中查找最接近的列值-Pandas

时间:2020-09-04 00:38:55

标签: python python-3.x pandas

以下是更大数据集的示例:

df_old = pd.DataFrame({'code': ['fea-1','fea-132','fea-223','fea-394','fea-595','fea-130','fea-495'],
                   'forecastWind_low':[20,15,0,45,45,25,45],
                   'forecastWind_high':['NaN' ,30,'NaN',55,65,35,'NaN'],
                   'obs_windSpeed':[20,11,3,65,55,'NaN',55]})

我有预报风速,需要将其与观测值进行比较...最终,我需要找到与观测风速值最接近的预报速度(低速或高速),以得到如下输出:

df_new = pd.DataFrame({'code': ['fea-1','fea-132','fea-223','fea-394','fea-595','fea-130','fea-495'],
                   'forecastWind_low':[20,15,0,45,45,25,45],
                   'forecastWind_high':['NaN' ,30,'NaN',55,65,35,'NaN'],
                   'obs_windSpeed':[20,11,3,65,55,'NaN',55],
                   'nearest_forecast_windSpeed':[20,15,0,55,45,'NaN',45]})

3 个答案:

答案 0 :(得分:1)

制作自定义比较功能并将其应用于行

def check_speed_diff(high,low,obs):
    if np.isnan(obs):
        return np.nan
    elif np.isnan(high):
        return low
    elif np.isnan(low):
        return high
    
    if abs(high-obs)<abs(low-obs):
        return high
    else:
        return low

df_old.apply(lambda x: 
    check_speed_diff(
        x.forecastWind_high,
        x.forecastWind_low,
        x.obs_windSpeed
    ),
    axis=1
)

答案 1 :(得分:1)

这是实现您正在寻找的另一种方法。它不仅可以比较两列。

col = ['forecastWind_low','forecastWind_high']
comparecol = ['obs_windSpeed']
df[col + comparecol] = df[col + comparecol].astype(float)
dfmerge =pd.merge(df[col].stack().reset_index(-1),df[comparecol],left_index=True,right_index=True,how='left')
dfmerge = dfmerge.rename(columns = {'level_1':'windforecast',0:'Amount'})
dfmerge['difference'] = abs(dfmerge['obs_windSpeed'] - dfmerge['Amount'])
dfmerge = dfmerge.sort_values(by='difference',ascending=True)
dfmerge = dfmerge.groupby(level=0).head(1)
df = pd.merge(df,dfmerge['Amount'],left_index=True,right_index=True,how='left')
df.loc[df['obs_windSpeed'].isna(),'Amount'] = np.nan

答案 2 :(得分:1)

修改Jeff的解决方案后,我设法提出了这一点:

def for_opt(params, ...):
  from functools import partial
  # sfmToPartial is a function which evaluates the model for a given cell and returns a loss/error
  sfmPart = partial(sfmToPartial, params=params, ...)
  # putting the mp.Pool here is bad, due to repeated calls of this function during optimization
  nCpu = mp.cpu_count(); # mp is the multiprocessing module
  with mp.Pool(processes = nCpu) as pool:
    lossAsList = pool.starmap(sfmPart, zip(range(len(cells)), cells, ...))

def initialize_optim(...):
  load_data_structures # a block of statements for loading the needed data structures
  init_params = set_initial_parameters # a block of statements for setting the initial parameters of the optimization

  obj = lambda params: for_opt(...) # declaring the objective function
  opt.minimize(obj, init_params, ...) # calling the minimization routine
  
  save_optim_results # a block of code saving the results of the optimization

if __name__ == 'main':
  initialize_optim(...);

我遇到的另一个问题是某些列/行中的字符串不是'NaN',所以我使用了pandas并强制了错误:

def check_speed_diff(high,low,obs):
    if obs == 'NaN':
        return np.nan
    if low != 'NaN' and high == 'NaN':
        return low
    if low == 'NaN' and high != 'NaN':
        return high
    if low != 'NaN' and high != 'NaN':
        if abs(high-obs)<abs(low-obs):
            return high
        else:
            return low

根据杰夫的建议应用的函数:

df.forecast_WindSpeed_high = pd.to_numeric(df.forecast_WindSpeed_high,errors='coerce')
df.forecast_WindSpeed_low = pd.to_numeric(df.forecast_WindSpeed_low ,errors='coerce')

可能不是最高效的,但是我完成了工作...谢谢大家的帮助。