使用python pandas cut函数创建垃圾箱时数据丢失

时间:2019-01-27 11:50:38

标签: python pandas dataframe

我的目标是将一列从df1转移到df2,同时创建bin。我有一个名为df1的数据框,其中包含3个数字变量。我想获取一个名为'tenure'的变量到df2并想创建bins,它将列值传输到df2但df2显示了一些缺失的值。 请在下面找到代码:

df2=pd.cut(df1["tenure"] , bins=[0,20,60,80], labels=['low','medium','high'])

在创建df2之前,我检查了df1中是否缺少值。没有那么令人着迷的值,但是在创建垃圾箱之后,它会显示11个缺失值。

print(df2.isnull().sum())

以上代码显示11个缺失值

感谢Anyones的帮助。

1 个答案:

答案 0 :(得分:1)

我假设您在df1['tenure']中有一些不在(0,80]中的值,也许是零。请参见下面的示例:

df1 = pd.DataFrame({'tenure':[-1, 0, 12, 34, 78, 80, 85]})
print (pd.cut(df1["tenure"] , bins=[0,20,60,80], labels=['low','medium','high']))

0       NaN    # -1 is lower than 0 so result is null
1       NaN    # it was 0 but the segment is open on the lowest bound so 0 gives null
2       low
3    medium
4      high
5      high    # 80 is kept as the segment is closed on the right
6       NaN    # 85 is higher than 80 so result is null
Name: tenure, dtype: category
Categories (3, object): [low < medium < high]

现在,您可以在include_lowest=True中传递参数pd.cut来保持结果的左边界:

print (pd.cut(df1["tenure"] , bins=[0,20,60,80], labels=['low','medium','high'],
              include_lowest=True))

0       NaN
1       low  # now where the value was 0 you get low and not null
2       low
3    medium
4      high
5      high
6       NaN
Name: tenure, dtype: category
Categories (3, object): [low < medium < high]

所以最后,我认为如果您打印len(df1[(df1.tenure <= 0) | (df1.tenure > 80)]),您的数据将得到11,作为nulldf2值的数目(这里是3)