我有一个薪水范围的数据框,如下所示:
import pandas as pd
df = pd.DataFrame(columns=['Salary'])
df.Salary = ['30,000-39,999', '5,000-7,499', '250,000-299,999', '4,000-4,999', '60,000-69,999', '10,000-14,999', '80,000-89,999', '$0-999', '2,000-2,999', '70,000-79,999', '90,000-99,999', '125,000-149,999', '$0-999', '$0-999', '40,000-49,999', '20,000-24,999', '125,000-149,999', '$0-999', '10,000-14,999', '15,000-19,999', '20,000-24,999', '100,000-124,999', '$0-999']
df
我想用数字替换薪水范围的这些字符串值,其中1表示$0-999
,2表示1000-1999
,依此类推。因此,下面是我执行此操作的代码,制作一个将字符串映射为数字的字典,并使用2进行循环-一个循环遍历数据帧中的每一行,一个循环遍历字典中的每个元素:
salary_dict = {'$0-999':1, '1,000-1,999':2, '2,000-2,999':3, '3,000-3,999':4, '4,000-4,999':5,
'5,000-7,499':6, '7,500-9,999':7, '10,000-14,999':8, '15,000-19,999':9, '20,000-24,999':10,
'25,000-29,999':11, '30,000-39,999':12, '40,000-49,999':13, '50,000-59,999':14, '60,000-69,999':15,
'70,000-79,999':16, '80,000-89,999':17, '90,000-99,999':18, '100,000-124,999':19, '125,000-149,999':20,
'150,000-199,999':21, '200,000-249,999':22, '250,000-299,999':23, '300,000-500,000':24, '> $500,000':25}
for i in range(len(df)):
for key in salary_dict:
if df.Salary[i]==key:
df.Salary[i] = salary_dict[key]
break
df
这对于较小的数据帧是可以的,但是对于较大(较长)的数据帧,代码需要很长时间才能完成运行。如何优化它?
答案 0 :(得分:1)
apply
函数。 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html apply
函数可将定义的任何功能应用于每个要素。df['Salary']
的每个元素映射到字典中的等效值。lambda x: salary_dict.get(x, x)
请查看python lambda。get
方法来保护密钥不在字典上的情况。df['Salary'] = df['Salary'].apply(lambda x: salary_dict.get(x, x))
print(df)
输出:
Salary
0 12
1 6
2 23
3 5
4 15
5 8
6 17
7 1
8 3
9 16
10 18
11 20
12 1
13 1
14 13
15 10
16 20
17 1
18 8
19 9
20 10
21 19
22 1