Question

我在数据框中有2列

1）工作经验（年）

2）company_type

我想根据工作经验列来估算company_type列。 company_type列具有要根据工作经验列填写的NaN值。工作经验列没有任何缺失值。

在这里work_exp是数字数据，company_type是类别数据。

示例数据：

Work_exp      company_type
   10            PvtLtd
   0.5           startup
   6           Public Sector
   8               NaN
   1             startup
   9              PvtLtd
   4               NaN
   3           Public Sector
   2             startup
   0               NaN

我已经确定了估算NaN值的阈值。

Startup if work_exp < 2yrs
Public sector if work_exp > 2yrs and <8yrs
PvtLtd if work_exp >8yrs

基于上述阈值标准，我该如何在company_type列中估算缺少的分类值。

Answer 1

您可以将numpy.select与numpy.where一起使用：

# define conditions and values
conditions = [df['Work_exp'] < 2, df['Work_exp'].between(2, 8), df['Work_exp'] > 8]
values = ['Startup', 'PublicSector', 'PvtLtd']

# apply logic where company_type is null
df['company_type'] = np.where(df['company_type'].isnull(),
                              np.select(conditions, values),
                              df['company_type'])

print(df)

   Work_exp  company_type
0      10.0        PvtLtd
1       0.5       startup
2       6.0  PublicSector
3       8.0  PublicSector
4       1.0       startup
5       9.0        PvtLtd
6       4.0  PublicSector
7       3.0  PublicSector
8       2.0       startup
9       0.0       Startup

pd.Series.between默认包含开始和结束值，并允许比较float个值。使用inclusive=False参数忽略边界。

s = pd.Series([2, 2.5, 4, 4.5, 5])

s.between(2, 4.5)

0     True
1     True
2     True
3     True
4    False
dtype: bool

Answer 2

@jpp的绝佳回答。只是想在这里使用pandas.cut()添加另一种方法。

df['company_type'] = pd.cut(
    df.Work_exp,
    bins=[0,2,8,100],
    right=False,
    labels=['Startup', 'Public', 'Private']
)



   Work_exp company_type
0   10.0    Private
1   0.5     Startup
2   6.0     Public
3   8.0     Private
4   1.0     Startup
5   9.0     Private
6   4.0     Public
7   3.0     Public
8   2.0     Public
9   0.0     Startup

还根据您的条件，索引8应该公开吗？

Startup < 2
PublicSector >=2 and < 8
PvtLtd >= 8

如何根据其他列的值来估算NaN值？

2 个答案: