熊猫,还有更快的方法来更新价值观吗?

时间:2017-11-16 12:39:09

标签: python pandas

目前,我的表有超过10000000个记录,并且有一个名为ID的列,如果ID在给定列表中,我想用新值更新名为'3rd_col'的列

我使用.loc,这是我的代码

for _id in given_ids:
    df.loc[df.ID == _id, '3rd_col'] = new_value

但是上面代码的性能很慢,我怎样才能提高更新值的性能?

很抱歉,这里我想更具体地说明我的问题,不同的ID根据功能分配不同的值,并且大约有4列要分配。

for _id in given_ids:
    df.loc[df.ID == _id, '3rd_col'] = return_new_val_1(id)
    df.loc[df.ID == _id, '4rd_col'] = return_new_val_2(id)
    df.loc[df.ID == _id, '5rd_col'] = return_new_val_3(id)
    df.loc[df.ID == _id, '6rd_col'] = return_new_val_4(id)

1 个答案:

答案 0 :(得分:5)

您可以先创建dictionary,然后replace

#sample function
def return_new_val(x):
    return x * 3

given_ids = list('abc')

d = {_id: return_new_val(_id) for _id in given_ids}
print (d)
{'a': 'aaa', 'c': 'ccc', 'b': 'bbb'}

df = pd.DataFrame({'ID':list('abdefc'),
                   'M':[4,5,4,5,5,4]})


df['3rd_col'] = df['ID'].replace(d)
print (df)

  ID  M 3rd_col
0  a  4     aaa
1  b  5     bbb
2  d  4       d
3  e  5       e
4  f  5       f
5  c  4     ccc

map,但随后获得NaN无匹配:

df['3rd_col'] = df['ID'].map(d)
print (df)

  ID  M 3rd_col
0  a  4     aaa
1  b  5     bbb
2  d  4     NaN
3  e  5     NaN
4  f  5     NaN
5  c  4     ccc

编辑:

如果需要通过多个功能追加数据,请首先创建新的DataFrame,然后join创建原始数据:

def return_new_val1(x):
    return x * 2

def return_new_val2(x):
    return x * 3


given_ids = list('abc')
df2 = pd.DataFrame({'ID':given_ids})
df2['3rd_col'] = df2['ID'].map(return_new_val1)
df2['4rd_col'] = df2['ID'].map(return_new_val2)
df2 = df2.set_index('ID')
print (df2)
   3rd_col 4rd_col
ID                
a       aa     aaa
b       bb     bbb
c       cc     ccc    
df = pd.DataFrame({'ID':list('abdefc'),
                   'M':[4,5,4,5,5,4]})

df = df.join(df2, on='ID')
print (df)
  ID  M 3rd_col 4rd_col
0  a  4      aa     aaa
1  b  5      bb     bbb
2  d  4     NaN     NaN
3  e  5     NaN     NaN
4  f  5     NaN     NaN
5  c  4      cc     ccc

#bur replace NaNs by values in `ID`
cols = ['3rd_col','4rd_col']
df[cols] = df[cols].mask(df[cols].isnull(), df['ID'], axis=0)
print (df)
  ID  M 3rd_col 4rd_col
0  a  4      aa     aaa
1  b  5      bb     bbb
2  d  4       d       d
3  e  5       e       e
4  f  5       f       f
5  c  4      cc     ccc