Question

我注意到当Pandas DataFrame中的列元素具有数字子字符串时，方法isnumeric将返回false。

例如：

row 1, column 1 has the following: 0002 0003 1289
row 2, column 1 has the following: 89060 324 123431132
row 3, column 1 has the following: 890GB 32A 34311TT
row 4, column 1 has the following: 82A 34311TT
row 4, column 1 has the following: 82A 34311TT 889 9999C

显然，第1行和第2行都是数字，但对于第1行和第2行，isnumeric返回false。

我发现了一个解决方法，即将每个子字符串分成各自的列，然后为每个子字符串创建一个布尔列，将booleans添加到一起以显示行是否全部为数字。然而，这是乏味的，我的功能看起来并不整洁。我也不想剥离和替换空格（将所有子串都压缩成一个数字）因为我需要保留原始子串。

有没有人知道一个更简单的解决方案/技术会正确地告诉我这些带有一个或多个数字子字符串的元素都是数字的？我的最终目标是删除这些仅限数字的行。

Answer 1

我认为需要使用split和all的列表理解来检查所有数字字符串：

mask = ~df['a'].apply(lambda x: all([s.isnumeric() for s in x.split()]))

mask = [not all([s.isnumeric() for s in x.split()]) for x in df['a']]

如果要检查至少有一个数字字符串是否使用any：

mask = ~df['a'].apply(lambda x: any([s.isnumeric() for s in x.split()]))

mask = [not any([s.isnumeric() for s in x.split()]) for x in df['a']]

Answer 2

以下是使用pd.Series.map，any和生成器表达式str.isdecimal和str.split的一种方法。

import pandas as pd

df = pd.DataFrame({'col1': ['0002 0003 1289', '89060 324 123431132', '890GB 32A 34311TT',
                            '82A 34311TT', '82A 34311TT 889 9999C']})

df['numeric'] = df['col1'].map(lambda x: any(i.isdecimal() for i in x.split()))

请注意，isdecimal为more strict而非isdigit。但您可能需要在Python 2.7中使用str.isdigit或str.isnumeric。

删除结果为False的行：

df = df[df['col1'].map(lambda x: any(i.isdecimal() for i in x.split()))]

<强>结果

逻辑的第一部分：

                    col1 numeric
0         0002 0003 1289    True
1    89060 324 123431132    True
2      890GB 32A 34311TT   False
3            82A 34311TT   False
4  82A 34311TT 889 9999C    True

删除具有任何数字子字符串的列行

2 个答案: