从Python中的另一列创建新列

时间:2019-05-14 08:55:53

标签: python pandas

我在python中有一个pandas数据框,我们称它为df

在此数据帧中,我根据存在列创建一个新列,如下所示:

df.loc[:, 'new_col'] = df['col']

然后我执行以下操作:

df[df['new_col']=='Above Average'] = 'Good'

但是,我注意到此操作还更改了df['col']

中的值

我应该怎么做才能使df['col']中的值不受df['new_col']中进行的操作的影响?

2 个答案:

答案 0 :(得分:2)

DataFrame.locboolean indexing一起使用:

df.loc[df['new_col']=='Above Average', 'new_col'] = 'Good'

如果未指定任何列,则根据条件将所有列设置为Good


另外,两行代码都应通过numpy.whereSeries.mask更改为一行:

df['new_col'] = np.where(df['new_col']=='Above Average', 'Good', df['col'])

df['new_col'] = df['col'].mask(df['new_col']=='Above Average', 'Good')

编辑:要更改许多值,请使用Series.replaceSeries.map和字典来指定值:

d = {'Good':['Above average','effective'], 'Very Good':['Really effective']}

#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'Above average': 'Good', 'effective': 'Good', 'Really effective': 'Very Good'}

df['new_col'] = df['col'].replace(d1)
#if large data obviously better performance
df['new_col'] = df['col'].map(d1).fillna(df['col'])

答案 1 :(得分:0)

还有一个使用数据框where方法的选项:

df['new_col'] = df['col']
df['new_col'].where(df['new_col']!='Above Average', other='Good', inplace=True )

但请注意,np.where是最快的方法:

m = df['col'] == 'Above Average'
df['new_column'] = np.where(m, 'Good', df['col'])

df['new_column']是新的列名。如果掩码mTrue df['col'],则分配其他'Good'


+----+---------------+
|    | col           |
|----+---------------|
|  0 | Nan           |
|  1 | Above Average |
|  2 | 1.0           |
+----+---------------+
+----+---------------+--------------+
|    | col           | new_column   |
|----+---------------+--------------|
|  0 | Nan           | Nan          |
|  1 | Above Average | Good         |
|  2 | 1.0           | 1.0          |
+----+---------------+--------------+

在这里,我还会在使用df.loc时提供一些有关遮罩的注意事项:

m = df['col']=='Above Average'
print(m)
df.loc[m, 'new_column'] = 'Good'

您可能会看到结果是相同的,但请注意,如果mm,则掩码False在何处具有读取值的信息