我在数据集中有两列,分别是PREVAILING_WAGE
和JOB_TITLE
。
JOB_TITLE
:
ANALYST, BRAND DEVELOPMENT
ANESTHESIOLOGIST
ANESTHESIOLOGIST
BUSINESS INTELLIGENCE ANALYSTS
CIVIL ENGINEER
CIVIL ENGINEER
COMPUTER PROGRAMMER
COMPUTER PROGRAMMER ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEMS ANAGLYST
COMPUTER SYSTEMS ANALYST
CONSULTANT
CORPORATE COMMUNICATIONS SPECIALIST
COUNSELOR
DESIGN
ELEMENTARY CO-TEACHER
FASHION MODEL
FIELD ENGINEER
FINANCIAL ANALYST
FINANCIAL SENIOR ANALYST
FINANCIAL SPECIALIST
这些值对应于PREVAILING_WAGE列中的NAN
值。通常我的数据大小是(700.000 X 2)
df2 = df[df.PREVAILING_WAGE.isnull()]
df3 = df2.sort_values(by='JOB_TITLE',ascending=True)
print(df3.JOB_TITLE)
我想填写这些JOB_TITLE
的工资(PREVAILING_WAGE
)列。
我想找到每个job_title的平均薪水值,然后将其分配给空职位。
例如,计算机编程的平均薪资为90k,而没有工资信息的计算机编程的薪水为90k
我在以下链接上看到了类似的问题,但其中不包含我想要的信息
答案 0 :(得分:1)
首先,我使用NaN
创建一些随机数据-这样我就可以测试代码了。
job_title = '''ANALYST, BRAND DEVELOPMENT
ANESTHESIOLOGIST
ANESTHESIOLOGIST
BUSINESS INTELLIGENCE ANALYSTS
CIVIL ENGINEER
CIVIL ENGINEER
COMPUTER PROGRAMMER
COMPUTER PROGRAMMER ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEMS ANAGLYST
COMPUTER SYSTEMS ANALYST
CONSULTANT
CORPORATE COMMUNICATIONS SPECIALIST
COUNSELOR
DESIGN
ELEMENTARY CO-TEACHER
FASHION MODEL
FIELD ENGINEER
FINANCIAL ANALYST
FINANCIAL SENIOR ANALYST
FINANCIAL SPECIALIST'''.split('\n')
job_title = list(set(job_title))
# --- create random data with some NaN
import random
data = []
# rows with `NaN`
for _ in range(1):
for item in job_title:
data.append( (item, None))
# rows with random SALARY
for _ in range(2):
for item in job_title:
data.append( (item, random.randint(10000,100000)))
# put all in random order
random.shuffle(data)
import pandas as pd
df = pd.DataFrame(data, columns=['JOB_TITLE', 'SALARY'])
现在我有DataFrame
个带有随机数据和Nan
的数字,因此我可以创建解决方案
此行使我可以过滤以仅查看具有NaN
的行-我将使用它来查看工作前后的行。
rows_with_na = df['SALARY'].isna()
我可以在工作前看到这些行
print('\n--- before ---\n')
print(df[ rows_with_na ])
我尝试了groupby
JOB_TITLE
,获取了mean()
并用组中的NaN
更新了fillna()
,但这并没有改变原始的df
< / p>
print('\n--- mean ---\n')
groups = df.groupby(['JOB_TITLE'])
for idx, grp in groups:
mean = grp['SALARY'].mean()
print('mean:', mean, idx)
print(grp['SALARY'].fillna(mean)) # doesn't work as I expected
print('---')
但是使用组以及transform
和fillna
的{{1}}可以在mean
中获得更改
df
现在我可以在下班后看到这些行
groups = df.groupby(['JOB_TITLE'])
#df['SALARY'] = groups.transform(lambda x: x.fillna(x.mean()))
#df['SALARY'] = groups.transform(lambda x: x.fillna(x.mean()))['SALARY']
df['SALARY'] = groups['SALARY'].transform(lambda x: x.fillna(x.mean()))