Question

我想知道当它涉及两个数据帧和一对多关系时，它是Excel Vlookup的替代品。我已经搜索过这个答案，并且不认为我找到了解决我的用例的问题。下面是一些相关的帖子，但不完全是我需要的。

情况：

我有两个数据框和一个链接两个（网站）的密钥。我会使用pandas merge函数，但我不想为一个键返回多个记录（在本例中为位置B）。

具体来说，我想从loc_status数据帧返回站点的状态（如果存在）。如果该位置的状态为＆＃34;不活动＆＃34;和＆＃34;活跃＆＃34;然后我只想回归＆＃34;活跃＆＃34;。

这是一个基本的例子：

loc_status = [{'site':'A', 'status':'ACTIVE'}, {'site':'B',         'status':'ACTIVE'},{'site':'B', 'status':'INACTIVE'},{'site':'C', 'status':'INACTIVE'} ]

loc = [{'site':'A'}, {'site':'B'},{'site':'C'}, {'site':'D'} ]

df_status = pd.DataFrame(loc_status)

+----+-------+----------+
|    | site  |  status  |
+----+-------+----------+
| 0  | A     | ACTIVE   |
| 1  | B     | ACTIVE   |
| 2  | B     | INACTIVE |
| 3  | C     | INACTIVE |
+----+-------+----------+

df_loc = pd.DataFrame(loc)

+------+---+
| site |   |
+------+---+
|   0  | A |
|   1  | B |
|   2  | C |
|   3  | D |
+------+---+

result = [{'site':'A', 'status':'ACTIVE'}, {'site':'B', 'status':'ACTIVE'},{'site':'C', 'status':'INACTIVE'}, {'site': 'D'}]

df_result = pd.DataFrame(result)

+----+-------+----------+
|    | site  |  status  |
+----+-------+----------+
| 0  | A     | ACTIVE   |
| 1  | B     | ACTIVE   |
| 2  | C     | INACTIVE |
| 3  | D     | NaN      |
+----+-------+----------+

谢谢。

Answer 1

两列的第一个sort_values，因此对于每个网站，如果重复status始终是ACTIVE。然后drop_duplicates默认为keep=first - 如果重复，则仅获取ACTIVE。

使用map创建Series的最后set_index：

df_status = df_status.sort_values(['site','status']).drop_duplicates(['site'])
print (df_status)
  site    status
0    A    ACTIVE
1    B    ACTIVE
3    C  INACTIVE

df_loc['status'] = df_loc['site'].map(df_status.set_index('site')['status'])
print (df_loc)
  site    status
0    A    ACTIVE
1    B    ACTIVE
2    C  INACTIVE
3    D       NaN

<强>计时：

#jezrael solution
In [136]: %timeit df_loc['status_jez'] = df_loc['site'].map(df_status.sort_values(['site','status']).drop_duplicates(['site']).set_index('site')['status'])
10 loops, best of 3: 67.3 ms per loop

#Allen solution
In [137]: %timeit pd.merge(df_loc,df_status.sort_values(['site','status']).groupby(by='site').first().reset_index(),how='left')
10 loops, best of 3: 114 ms per loop

#piRSquared solution
In [138]: %timeit df_loc.assign(status_pir=df_loc.site.map(df_status.loc[(df_status.status == 'ACTIVE').groupby(df_status.site).idxmax()].set_index('site').status))
1 loop, best of 3: 3.37 s per loop

时间安排的代码：

np.random.seed(123)
N = 100000
L = np.random.randint(100000,size=N)
df_status = pd.DataFrame({'site': np.random.choice(L, N),
                         'status':np.random.choice(['ACTIVE','INACTIVE'],N)})
print (df_status.head(10))

df_loc = pd.DataFrame({'site':L})
print (df_loc.head(10))

Answer 2

#remove Inactive rows if there's an active row for a certain site.
df_status = df_status.sort_values(['site','status']).groupby(by='site').first().reset_index()
#join loc and status df.
pd.merge(df_loc,df_status,how='left')

Out[108]: 
  site    status
0    A    ACTIVE
1    B    ACTIVE
2    C  INACTIVE
3    D       NaN

Answer 3

定义status系列布尔值，确定是否status == 'ACTIVE'
为方便起见，定义site系列
通过status对site进行分组并对idxmax进行分类，如果它有一个status，我会找到ACTIVE idx的杉木索引。
使用df_status，我可以将site切换为ACTIVE的唯一值，并优先选择status
通过设置索引并转到pd.Series列，我创建了一个dict，其map就像assign
使用map + status = df_status.status == 'ACTIVE' site = df_status.site idx = status.groupby(site).idxmax() m = df_status.loc[idx].set_index('site').status df_loc.assign(status=df_loc.site.map(m)) site status 0 A ACTIVE 1 B ACTIVE 2 C INACTIVE 3 D NaN创建新列

timecheck()

Pandas相当于具有一对多关系的vlookup，返回一个结果

3 个答案: