Question

我是Python和Pandas的新手，我正在努力解决这个问题。

我有一组Age列float64列的数据。有些值有一小部分，有些则没有。我想删除Age的所有具有整数值的行。

这是我的尝试：

estimatedAges = train[int(train['Age']) < train['Age']]

但我收到了这个错误：

TypeError Traceback（最近一次调用最后一次）    in（）         1 #estimatedAges = train [train ['Age']＆gt; 1]   ----＆GT; 2估计年龄= train [int（train ['Age']）＆lt;列车[ '年龄']]         3 estimatedAges.info（）

C：\ Anaconda3 \ lib \ site-packages \ pandas \ core \ series.py in wrapper（self）        76返回转换器（self.iloc [0]）        77引发TypeError（“无法将系列转换为”   ---＆GT; 78“{0}”。format（str（converter）））        79        80返回包装器

TypeError：无法将系列转换为＆lt; class'int'`＆gt;

所以，看起来int()对系列数据不起作用，我将不得不寻找另一种方法，我只是不确定其他方法是什么。

Answer 1

我认为您可以使用astype投射到int：

estimatedAges = train[train['Age'].astype(int) < train['Age']]

样品：

train = pd.DataFrame({'Age':[1,2,3.4]})
print (train)
   Age
0  1.0
1  2.0
2  3.4

print (train[train['Age'].astype(int) < train['Age']])
   Age
2  3.4

<强>计时：

train = pd.DataFrame({'Age':[1,2,3.4]})
train = pd.concat([train]*10000).reset_index(drop=True)

In [62]: %timeit (train[train['Age'].astype(int) < train['Age']])
The slowest run took 6.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 544 µs per loop

In [63]: %timeit (train[train['Age'].apply(int) < train['Age']])
100 loops, best of 3: 11.1 ms per loop

In [64]: %timeit (train[train.Age > train.Age.round(0)])
1000 loops, best of 3: 1.55 ms per loop

通过ajcr的评论进行编辑，谢谢：

如果值为负且正浮点数，请使用：

train = pd.DataFrame({'Age':[1,-2.8,3.9]})
print (train)
   Age
0  1.0
1 -2.8
2  3.9

print (train[train['Age'].astype(int) != train['Age']])
   Age
1 -2.8
2  3.9

Answer 2

试试这个：

In [179]: train[train.Age != train.Age // 1]
Out[179]:
   Age
2  3.4

Answer 3

我最终选择了@ jezreal的回答，因为他的速度测试令人信服，但我想再添加一个我发现的解决方案。它需要numpy，但是如果你已经导入了pandas，那么你很可能也会导入numpy。

import numpy as np
train[np.floor(train['Age']) != train['Age']]

如何筛选数据框中具有整数的列中的行

3 个答案: