在Python中填充特定的缺失值

时间:2019-04-16 00:44:32

标签: python pandas scikit-learn

我在数据集中有两列,分别是PREVAILING_WAGEJOB_TITLE

JOB_TITLE

ANALYST, BRAND DEVELOPMENT
ANESTHESIOLOGIST
ANESTHESIOLOGIST
BUSINESS INTELLIGENCE ANALYSTS
CIVIL ENGINEER
CIVIL ENGINEER
COMPUTER PROGRAMMER
COMPUTER PROGRAMMER ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEMS ANAGLYST
COMPUTER SYSTEMS ANALYST
CONSULTANT
CORPORATE COMMUNICATIONS SPECIALIST
COUNSELOR
DESIGN
ELEMENTARY CO-TEACHER
FASHION MODEL
FIELD ENGINEER
FINANCIAL ANALYST
FINANCIAL SENIOR ANALYST
FINANCIAL SPECIALIST

这些值对应于PREVAILING_WAGE列中的NAN值。通常我的数据大小是(700.000 X 2)

df2 = df[df.PREVAILING_WAGE.isnull()]
df3 = df2.sort_values(by='JOB_TITLE',ascending=True)
print(df3.JOB_TITLE)

我想填写这些JOB_TITLE的工资(PREVAILING_WAGE)列。

我想找到每个job_title的平均薪水值,然后将其分配给空职位。

例如,计算机编程的平均薪资为90k,而没有工资信息的计算机编程的薪水为90k

我在以下链接上看到了类似的问题,但其中不包含我想要的信息

Filling Missing values Pandas Dataframe by specific value

1 个答案:

答案 0 :(得分:1)

首先,我使用NaN创建一些随机数据-这样我就可以测试代码了。

job_title = '''ANALYST, BRAND DEVELOPMENT
ANESTHESIOLOGIST
ANESTHESIOLOGIST
BUSINESS INTELLIGENCE ANALYSTS
CIVIL ENGINEER
CIVIL ENGINEER
COMPUTER PROGRAMMER
COMPUTER PROGRAMMER ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEM ANALYST
COMPUTER SYSTEMS ANAGLYST
COMPUTER SYSTEMS ANALYST
CONSULTANT
CORPORATE COMMUNICATIONS SPECIALIST
COUNSELOR
DESIGN
ELEMENTARY CO-TEACHER
FASHION MODEL
FIELD ENGINEER
FINANCIAL ANALYST
FINANCIAL SENIOR ANALYST
FINANCIAL SPECIALIST'''.split('\n')

job_title = list(set(job_title))

# --- create random data with some NaN
import random

data = []

# rows with `NaN`
for _ in range(1):
    for item in job_title:
        data.append( (item, None))

# rows with random SALARY
for _ in range(2):    
    for item in job_title:
        data.append( (item, random.randint(10000,100000)))    

# put all in random order
random.shuffle(data)

import pandas as pd

df = pd.DataFrame(data, columns=['JOB_TITLE', 'SALARY'])

现在我有DataFrame个带有随机数据和Nan的数字,因此我可以创建解决方案

此行使我可以过滤以仅查看具有NaN的行-我将使用它来查看工作前后的行。

rows_with_na = df['SALARY'].isna()

我可以在工作前看到这些行

print('\n--- before ---\n')
print(df[ rows_with_na ])

我尝试了groupby JOB_TITLE,获取了mean()并用组中的NaN更新了fillna(),但这并没有改变原始的df < / p>

print('\n--- mean ---\n')

groups = df.groupby(['JOB_TITLE'])

for idx, grp in groups:
    mean = grp['SALARY'].mean()
    print('mean:', mean, idx)
    print(grp['SALARY'].fillna(mean)) # doesn't work as I expected
    print('---')

但是使用组以及transformfillna的{​​{1}}可以在mean中获得更改

df

现在我可以在下班后看到这些行

groups = df.groupby(['JOB_TITLE'])

#df['SALARY'] = groups.transform(lambda x: x.fillna(x.mean()))
#df['SALARY'] = groups.transform(lambda x: x.fillna(x.mean()))['SALARY']
df['SALARY'] = groups['SALARY'].transform(lambda x: x.fillna(x.mean()))