Question

我在pandas中有一个大型数据框，除了用作索引的列之外，它应该只包含数值：

df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

如何找到其中包含非数字值的数据框df的行？

在此示例中，它是数据框中的第四行，其中'bad'列中包含字符串a。如何以编程方式找到这一行？

Answer 1

您可以使用np.isreal检查每个元素的类型（applymap将函数应用于DataFrame中的每个元素）：

In [11]: df.applymap(np.isreal)
Out[11]:
          a     b
item
a      True  True
b      True  True
c      True  True
d     False  True
e      True  True

如果行中的所有内容都为True，那么它们都是数字：

In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a        True
b        True
c        True
d       False
e        True
dtype: bool

所以要获取rouges的subDataFrame，（注意：上面的否定，〜，找到至少有一个非数字流氓的那个）：

In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
        a    b
item
d     bad  0.4

您还可以找到您可以使用的{em>第一个罪犯的位置argmin：

In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'

正如@CTZhu指出的那样，check whether it's an instance of int或float可能会稍快一些（np.isreal有一些额外的开销）：

df.applymap(lambda x: isinstance(x, (int, float)))

Answer 2

这个问题已经有了一些很好的答案，但是这里有一个很好的代码片段，如果它们在某些列上有非数字值，我会定期使用它来删除行：

# Eliminate invalid data from dataframe (see Example below for more context)

num_df = (df.drop(data_columns, axis=1)
         .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

num_df = num_df[num_df[data_columns].notnull().all(axis=1)]

这样做的方式是我们首先drop来自data_columns的所有df，然后使用join将它们传递回pd.to_numeric 1}}（使用选项'coerce'，以便所有非数字条目都转换为NaN）。结果将保存到num_df。

在第二行，我们使用一个过滤器，只保留所有值都不为空的行。

请注意pd.to_numeric正在强制NaN无法转换为数值的所有内容，因此不会删除表示数值的字符串。例如，'1.25'将被识别为数值1.25。

免责声明：pandas版本pd.to_numeric

中引入了0.17.0

示例：

In [1]: import pandas as pd In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"], ...: "a": [1,2,3,"bad",5], ...: "b":[0.1,0.2,0.3,0.4,0.5]}) In [3]: df Out[3]: a b item 0 1 0.1 a 1 2 0.2 b 2 3 0.3 c 3 bad 0.4 d 4 5 0.5 e In [4]: data_columns = ['a', 'b'] In [5]: num_df = (df ...: .drop(data_columns, axis=1) ...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce'))) In [6]: num_df Out[6]: item a b 0 a 1 0.1 1 b 2 0.2 2 c 3 0.3 3 d NaN 0.4 4 e 5 0.5 In [7]: num_df[num_df[data_columns].notnull().all(axis=1)] Out[7]: item a b 0 a 1 0.1 1 b 2 0.2 2 c 3 0.3 4 e 5 0.5

Answer 3

对于这种混淆感到抱歉，这应该是正确的做法。您是否只想捕获'bad'，而不是'good';或者只是任何非数值？

In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)

Answer 4

# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

Convert to numeric使用“强制”，用“ nan”填充错误的值

a = pd.to_numeric(df.a, errors='coerce')

使用isna返回布尔值索引：

idx = a.isna()

将该索引应用于数据框：

df[idx]

输出

返回其中包含错误数据的行：

        a    b
item          
d     bad  0.4

Answer 5

如果您正在使用包含字符串值的列，则可以使用非常有用的函数series.str.isnumeric（）喜欢：

a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])

我所做的是将该列复制到新列，然后执行str.replace（'。'，''）和str.replace（'，'，''）然后选择数值。和

a = a.str.replace('.','')
a = a.str.replace(',','') 
a.str.isnumeric()

缺货[15]： 0错 1错 2对 3对 4真实 5对 6对 7对 dtype：bool

祝你好运！

Answer 6

我在想类似的事情，只是提供一个想法，即可将列转换为字符串，并且使用字符串更容易。但是，这不适用于包含数字的字符串，例如bad123。 ~正在选择的补充。

df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)

并使用'|'.join([str(i) for i in range(10)])生成'0|1|...|8|9'

或使用np.isreal()功能，就像投票最多的答案一样

df[~df['a'].apply(lambda x: np.isreal(x))]

在pandas中的数据框中查找非数字行？

6 个答案:

输出