我有两个数据帧,df1如下所示:
id year CalendarWeek DayName interval counts
1 2014 1 sun 10:30 3
1 2014 1 sun 11:30 4
1 2014 2 wed 12:00 5
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 3
2 2014 1 sun 14:30 1
2 2014 1 mon 10:30 2
2 2014 2 wed 14:00 3
2 2014 2 fri 15:00 5
3 2014 1 thu 16:30 2
3 2014 1 thu 17:00 1
3 2014 2 sat 12:00 2
3 2014 2 sat 13:30 3
df2如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:00 2
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 10:30 6
1 2014 2 wed 12:00 3
1 2014 2 fri 8:30 1
1 2014 2 fri 9:00 2
2 2014 1 sun 12:30 3
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 9:00 35
2 2014 1 mon 10:30 1
2 2014 2 wed 12:30 23
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 14:30 1
3 2014 1 thu 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
3 2014 2 sat 14:00 2
我想获取df2中与df1中的列id,年份,CalendarWeek,DayName和间隔匹配的所有行。 我想要的结果应如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 12:00 3
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 10:30 1
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
在Python中,如何根据另一个数据框中的列选择数据框中的这些特定行?
谢谢!
答案 0 :(得分:2)
执行merge
并将列列表传递给参数on
,默认的合并类型为'inner'
,它只匹配dfs中存在值的位置:
In [2]:
df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
Out[2]:
id year CalendarWeek DayName interval counts NewCounts
0 1 2014 1 sun 10:30 3 4
1 1 2014 1 sun 11:30 4 5
2 1 2014 2 wed 12:00 5 3
3 1 2014 2 fri 9:00 2 2
4 2 2014 1 sun 13:00 3 4
5 2 2014 1 sun 14:30 1 4
6 2 2014 1 mon 10:30 2 1
7 2 2014 2 wed 14:00 3 4
8 2 2014 2 fri 15:00 5 3
9 3 2014 1 thu 16:30 2 34
10 3 2014 1 thu 17:00 1 5
11 3 2014 2 sat 12:00 2 3
12 3 2014 2 sat 13:30 3 4
如果你的身份是' column是你的索引,你必须重置两个df上的索引,以便它们成为df中的一列,这是因为如果你指定内连接会产生不正确的结果on
列列表,并指定left_index=True
和right_index=True
:
In [4]:
df.merge(df1, on=['year','CalendarWeek','DayName','interval'], left_index=True, right_index=True)
Out[4]:
year CalendarWeek DayName interval counts NewCounts
id
1 2014 1 sun 10:30 3 2
1 2014 1 sun 10:30 3 4
1 2014 1 sun 10:30 3 5
1 2014 1 sun 10:30 3 6
1 2014 1 sun 10:30 3 3
1 2014 1 sun 10:30 3 1
1 2014 1 sun 10:30 3 2
1 2014 1 sun 11:30 4 2
1 2014 1 sun 11:30 4 4
1 2014 1 sun 11:30 4 5
1 2014 1 sun 11:30 4 6
1 2014 1 sun 11:30 4 3
1 2014 1 sun 11:30 4 1
1 2014 1 sun 11:30 4 2
1 2014 2 wed 12:00 5 2
1 2014 2 wed 12:00 5 4
1 2014 2 wed 12:00 5 5
1 2014 2 wed 12:00 5 6
1 2014 2 wed 12:00 5 3
1 2014 2 wed 12:00 5 1
1 2014 2 wed 12:00 5 2
1 2014 2 fri 9:00 2 2
1 2014 2 fri 9:00 2 4
1 2014 2 fri 9:00 2 5
1 2014 2 fri 9:00 2 6
1 2014 2 fri 9:00 2 3
1 2014 2 fri 9:00 2 1
1 2014 2 fri 9:00 2 2
2 2014 1 sun 13:00 3 3
2 2014 1 sun 13:00 3 4
.. ... ... ... ... ... ...
2 2014 2 fri 15:00 5 4
2 2014 2 fri 15:00 5 3
3 2014 1 thu 16:30 2 1
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 34
3 2014 1 thu 16:30 2 5
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 4
3 2014 1 thu 16:30 2 2
3 2014 1 thu 17:00 1 1
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 34
3 2014 1 thu 17:00 1 5
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 4
3 2014 1 thu 17:00 1 2
3 2014 2 sat 12:00 2 1
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 34
3 2014 2 sat 12:00 2 5
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 4
3 2014 2 sat 12:00 2 2
3 2014 2 sat 13:30 3 1
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 34
3 2014 2 sat 13:30 3 5
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 4
3 2014 2 sat 13:30 3 2
[96 rows x 6 columns]
所以要重置索引只需执行df = df.reset_index(0)
,同样对于其他df,在合并之后,您可以将索引设置回id,这样:
merged = df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
merged = merged.reset_index()