python pandas:如何在一个数据框中查找行而不在另一个数据框中查找行?

时间:2015-09-18 12:17:10

标签: python pandas dataframe

假设我有两个表:people_allpeople_usa,它们具有相同的结构,因此具有相同的主键。

如何获得不在美国的人的表格? 在SQL中我会做类似的事情:

select a.*
from people_all a

left outer join people_usa u
on a.id = u.id

where u.id is null

Python的等价物是什么?我想不出把这个where语句翻译成pandas语法的方法。

我能想到的唯一方法是向people_usa添加一个任意字段(例如people_usa['dummy']=1),进行左连接,然后只获取'dummy'为nan的记录,然后删除虚拟字段 - 这看起来有点令人费解。

谢谢!

3 个答案:

答案 0 :(得分:12)

使用nbind并取消布尔掩码:

isin

示例:

people_usa[~people_usa['ID'].isin(people_all ['ID'])]

因此从结果中删除3和4,布尔掩码如下所示:

In [364]:
people_all = pd.DataFrame({ 'ID' : np.arange(5)})
people_usa = pd.DataFrame({ 'ID' : [3,4,6,7,100]})
people_usa[~people_usa['ID'].isin(people_all['ID'])]

Out[364]:
    ID
2    6
3    7
4  100

使用In [366]: people_usa['ID'].isin(people_all['ID']) Out[366]: 0 True 1 True 2 False 3 False 4 False Name: ID, dtype: bool 反转掩码

答案 1 :(得分:2)

这是另一个类似于SQL Pandas的方法:.query()

from kivy.app import App
from kivy.uix.widget import Widget
from kivy.graphics import *
from kivy.properties import NumericProperty, ReferenceListProperty, ObjectProperty
from kivy.vector import Vector
from kivy.clock import Clock
from kivy.lang import Builder

class Planet(Widget):

    # velocity of the ball on x and y axis
    dx = NumericProperty(0)
    dy = NumericProperty(0)

    def init(self,  pos=(50,50), **kwargs):
        """ Initialize the planet"""
        self.pos = pos
        print("Init planet. pos:", self.pos)
        # These shapes do not move with the widget.
        #  Why?
        # Only the white circle in .kv lang moves with it.
        self.canvas.add(Color(0.8,0,0))
        self.canvas.add(Ellipse(pos=self.pos, size=(50,50)))


    def move(self):
        """ Move the planet. """
        self.pos = Vector(self.velocity) + self.pos
        print("Planet now at", self.pos)



class System(Widget):

    mars = ObjectProperty(None)

    def update(self, dt):
        print("Update! " , dt)
        if self.mars:
            self.mars.move()

    def spawn(self, dt):
        print("Insert!", dt)
        self.mars = Planet()
        self.mars.init()
        self.add_widget(self.mars)
        self.mars.velocity = (1,1)


class PlanetApp(App):
    def build(self):
        sys = System()
        Clock.schedule_interval(sys.update, 1/4)
        Clock.schedule_once(sys.spawn, 3)
        return sys

if __name__ == '__main__':
    Builder.load_string("""
#:kivy 1.0.9
<Planet>
    canvas:
        Ellipse:
            pos: self.pos
            size: self.size
""")

    PlanetApp().run()

或使用NumPy的in1d()方法:

people_all.query('ID not in @people_usa.ID')

注意:对于有SQL经验的人,可能需要阅读Pandas comparison with SQL

答案 2 :(得分:-1)

我将组合(通过堆叠)数据帧,然后执行.drop_duplicates方法。在此处找到文档:

http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html