Question

我有一个数据框：

    Name    Hours_Worked
1   James   3
2   Sam     2.5
3   Billy   T
4   Sarah   A
5   Felix   5

第一如何计算具有非数字值的行数？

2nd如何过滤以识别包含非数值的行？

Answer 1

将to_numeric与errors='coerce'一起用于将非数字转换为NaN，并通过isna创建掩码：

mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isna()
#oldier pandas versions
#mask = pd.to_numeric(df['Hours_Worked'], errors='coerce').isnull()

然后按True计算sum的值：

a = mask.sum()
print (a)
2

并按boolean indexing进行过滤：

df1 = df[mask]
print (df1)
    Name Hours_Worked
3  Billy            T
4  Sarah            A

详细信息：

print (mask)
1    False
2    False
3     True
4     True
5    False
Name: Hours_Worked, dtype: bool

另一种检查数字的方法：

def check_num(x):
    try:
        float(x)
        return False        
    except ValueError:
        return True

mask = df['Hours_Worked'].apply(check_num)

Answer 2

最终，我这样做是为了对数字列中的字符串进行评估：

df['Hr_String'] = pd.to_numeric(df['Hours_Worked'], errors='coerce')

我希望在新的专栏中找到它，以便我进行过滤并为我提供更多的帮助：

df[df['Hr_String'].isnull()]

它返回：

    Name    Hours_Worked    Hr_String
2   Billy   T               NaN
3   Sarah   A               NaN

然后我做了

df['Hr_String'].isnull().sum()

它返回：

2

然后我想要总行数的百分比，所以我这样做了：

teststr['Hr_String'].isnull().sum() / teststr.shape[0]

它返回：

0.4

总的来说，这种方法对我有用，它帮助我了解了什么字符串值使我的数字列弄乱了，并允许我查看百分比，如果它真的很小，我可以删除行进行分析。如果百分比很大，我将不得不弄清楚是否可以估算出它们或为它们找出其他东西。

计算数字列熊猫中的字符串值

2 个答案: