Question

我在python中使用Pandas制作了两个数据帧：

df1

id    business   state   inBusiness
1     painter     AL        no
2     insurance   AL        no
3     lawyer      OH        no
4     dentist     NY        yes
...........

df2  

id    business    state
1     painter       NY
2     painter       AL
3     builder       TX
4     painter       AL    
......

基本上，如果df2中存在完全相同的业务/位置组合的实例，我想将df1中的'inBusiness'值设置为'yes'。

因此，例如，如果df2中存在painter / AL，则df1中所有painter / AL实例的'inBusiness'值都设置为yes。

我现在能想到的最好的是：

for index, row in df2.iterrows():
    df1[ (df1.business==str(row['business'])) & (df1.state==str(row['state']))]['inBusiness'] = 'Yes'

但第一个数据帧可能有数十万行要遍历第二个数据帧中的每一行，因此这种方法不太可靠。我可以在这里使用一个很好的单行程，也很快吗？

Answer 1

您可以使用.merge(how='left', indicator=True)（indicator，see docs中添加pandas>=0.17）来识别匹配的列以及匹配的来源，以获得这些内容：

df1.merge(df2, how='left', indicator=True) # merges by default on shared columns

   id   business state inBusiness     _merge
0   1    painter    AL         no       both
1   2  insurance    AL         no  left_only
2   3     lawyer    OH         no  left_only
3   4    dentist    NY        yes  left_only

_merge表示(business, state)和df1中df2组合在哪些情况下可用。然后你只需要：

df['inBusiness'] = df._merge == 'both'

得到：

   id   business state inBusiness     _merge
0   1    painter    AL       True       both
1   2  insurance    AL      False  left_only
2   3     lawyer    OH      False  left_only
3   4    dentist    NY      False  left_only

Answer 2

创建地图可能效率最高

inBusiness = {(business,state): 'yes' 
              for business,state in zip(df2['business'],df2['state'])}
df1['inBusiness'] = [ inBusiness.get((business,state),"no") 
                     for business,state in zip(df1['business'],df1['state'])]
df1

OUTPUTS

    id  business    state   inBusiness
0   1   painter     AL  yes
1   2   insurance   AL  no
2   3   lawyer      OH  no
3   4   dentist     NY  no

解释编辑：

你对“进一步解释”很模糊，所以我会给出高水平的一切

内置zip函数，它接受两个迭代（如两个列表，或两个系列），并将它们“拉链”成元组。

a = [1,2,3]
b = ['a','b','c']
for tup in zip(a,b): print(tup)

输出：

(1, 'a')
(2, 'b')
(3, 'c')

此外，python中的元组可以“解包”到单个变量中

tup = (3,4)
x,y = tup
print(x)
print(y)

您可以将这两件事结合起来创建dictionary comprehensions

newDict = {k: v for k,v in zip(a,b)}
newDict

输出：

{1: 'a', 2: 'b', 3: 'c'}

inBusiness是在将df2['business']和df2['state']系列压缩在一起后使用字典理解创建的python字典。

我实际上并不需要解压缩变量，但我这样做是因为我认为它是清晰的。

请注意，此地图只是您要跳过的地图的一半，因为字典中的每个键(business,state)都会映射到yes。值得庆幸的是，如果找不到密钥，dict.get让我们specify a default value返回 - 在您的情况下是"no"

然后，使用list-comprehension创建所需的列以获得所需的结果。

这涵盖了一切吗？

根据不同数据帧中两列的值更改数据框列中的值

2 个答案: