Question

我有一个包含四列的csv文件。我这样读了：

df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])

现在，字段C包含字符串值。但在某些行中存在非字符串类型（浮点数或数字）值。如何删除这些行？我正在使用Pandas的0.18.1版本。

Answer 1

设置

df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df

   A  B    C  D
0  a  b    c  d
1  e  f  1.2  g

请注意，您可以看到各个细胞类型是什么。

print type(df.loc[0, 'C']), type(df.loc[1, 'C'])

<type 'str'> <type 'float'>

掩码和切片

print df.loc[df.C.apply(type) != float]

   A  B  C  D
0  a  b  c  d

更一般

print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]

   A  B  C  D
0  a  b  c  d

您也可以使用float来确定它是否可以是浮点数。

def try_float(x):
    try:
        float(x)
        return True
    except:
        return False

print df.loc[~df.C.apply(try_float)]

   A  B  C  D
0  a  b  c  d

这种方法的问题在于您将排除可以解释为浮点数的字符串。

比较我提供的几个选项的时间以及jezrael的小数据帧解决方案。

对于包含500,000行的数据框：

检查它的类型是否为浮点似乎是最高效的，它背后是数字。如果你需要检查int和float，我会选择jezrael的回答。如果你可以逃脱检查浮动，那就使用那个。

Answer 2

您可以将boolean indexing与to_numeric创建的mask与参数errors='coerce'一起使用 - 您获得NaN字符串值。然后查看isnull：

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':['a',8,9],
                   'D':[1,3,5]})
print (df)
   A  B  C  D
0  1  4  a  1
1  2  5  8  3
2  3  6  9  5

print (pd.to_numeric(df.C, errors='coerce'))
0    NaN
1    8.0
2    9.0
Name: C, dtype: float64

print (pd.to_numeric(df.C, errors='coerce').isnull())
0     True
1    False
2    False
Name: C, dtype: bool

print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
   A  B  C  D
0  1  4  a  1

Answer 3

使用pandas.DataFrame。select_dtypes方法。例如

df.select_dtypes(exclude='object')
         or
df.select_dtypes(include=['int64','float','int'])

如何在Pandas的列中删除不包含字符串类型的行？

3 个答案:

设置