加速vlookup就像在python中使用pandas一样操作

时间:2014-02-22 23:19:25

标签: python performance pandas vlookup

我已经编写了一些代码来基本上在两个pandas数据帧上执行excel样式的vlookup,并希望加快速度。

数据框的结构如下: dbase1_df.columns:
    'VALUE','COUNT','GRID','SGO10GEO'

merged_df.columns:
    'GRID','ST0',ST1','ST2','ST3','ST4','ST5','ST6','ST7','ST8','ST9','ST10'

sgo_df.columns:
    'mkey','type'

要合并它们,我会执行以下操作:
1.对于dbase1_df中的每一行,找到其“SGO10GEO”值与sgo_df的“mkey”值匹配的行。从sgo_df中的该行获取'type'。

  1. 'type'包含一个介于0到10之间的整数。通过将'ST'附加到type来创建列名。

  2. 在merged_df中查找值,其中“GRID”值与dbase1_df中的“GRID”值匹配,列名称是我们在步骤2中获得的值。将此值输出到csv文件中。

  3. //将dbase1 dbf读入数据框

    dbase1_df = pandas.DataFrame.from_csv(dbase1_file,index_col = False)
    merged_df = pandas.DataFrame.from_csv('merged.csv',index_col = False)

    lup_out.writerow([ “VALUE”, “类型”,EXTRACT_VAR.upper()])
    //对于dbase1数据框中的每个唯一值:
    对于index,dbase1_df.iterrows()中的行:

    # 1. Find the soil type corresponding to the mukey
    tmp  = sgo_df.type.values[sgo_df['mkey'] == int(row['SGO10GEO'])]
    if tmp.size > 0:        
        s_type = 'ST'+tmp[0]
        val       = int(row['VALUE'])            
    
        # 2. Obtain hmu value
        tmp_val  = merged_df[s_type].values[merged_df['GRID'] == int(row['GRID'])] 
        if tmp_val.size > 0:
            hmu_val = tmp_val[0]             
            # 4. Output into data frame: VALUE, hmu value
            lup_out.writerow([val,s_type,hmu_val])
        else:
            err_out.writerow([merged_df['GRID'], type, row['GRID']])
    

    这里有什么可能是速度瓶颈吗?目前在dbase1_df中大约需要20分钟左右~500,000行; merged_df中约1,000行,sgo_df中约500,000行。

    谢谢!

1 个答案:

答案 0 :(得分:3)

您需要在Pandas中使用合并操作才能获得更好的性能。我无法测试下面的代码,因为我没有数据,但至少它应该可以帮助你理解:

import pandas as pd

dbase1_df = pd.DataFrame.from_csv('dbase1_file.csv',index_col=False)
sgo_df = pd.DataFrame.from_csv('sgo_df.csv',index_col=False)
merged_df = pd.DataFrame.from_csv('merged_df.csv',index_col=False)

#you need to use the same column names for common columns to be able to do the merge operation in pandas , so we changed the column name to mkey

dbase1_df.columns = [u'VALUE', u'COUNT', u'GRID', u'mkey']

#Below operation merges the two dataframes
Step1_Merge = pd.merge(dbase1_df,sgo_df)

#We need to add a new column to concatenate ST and type
Step1_Merge['type_2'] = Step1_Merge['type'].map(lambda x: 'ST'+str(x))

# We need to change the shape of merged_df and move columns to rows to be able to do another merge operation
id = merged_df.ix[:,['GRID']]
a = pd.merge(merged_df.stack(0).reset_index(1), id, left_index=True, right_index=True)

# We also need to change the automatically generated name to type_2 to be able to do the next merge operation
a.columns = [u'type_2', 0, u'GRID']


result = pd.merge(Step1_Merge,a,on=[u'type_2',u'GRID'])