根据不同数据帧中两列的值更改数据框列中的值

时间:2016-06-17 14:50:44

标签: python pandas dataframe

我在python中使用Pandas制作了两个数据帧:

df1

id    business   state   inBusiness
1     painter     AL        no
2     insurance   AL        no
3     lawyer      OH        no
4     dentist     NY        yes
...........

df2  

id    business    state
1     painter       NY
2     painter       AL
3     builder       TX
4     painter       AL    
......

基本上,如果df2中存在完全相同的业务/位置组合的实例,我想将df1中的'inBusiness'值设置为'yes'。

因此,例如,如果df2中存在painter / AL,则df1中所有painter / AL实例的'inBusiness'值都设置为yes。

我现在能想到的最好的是:

for index, row in df2.iterrows():
    df1[ (df1.business==str(row['business'])) & (df1.state==str(row['state']))]['inBusiness'] = 'Yes'

但第一个数据帧可能有数十万行要遍历第二个数据帧中的每一行,因此这种方法不太可靠。我可以在这里使用一个很好的单行程,也很快吗?

2 个答案:

答案 0 :(得分:2)

您可以使用.merge(how='left', indicator=True)indicatorsee docs中添加pandas>=0.17)来识别匹配的列以及匹配的来源,以获得这些内容:

df1.merge(df2, how='left', indicator=True) # merges by default on shared columns

   id   business state inBusiness     _merge
0   1    painter    AL         no       both
1   2  insurance    AL         no  left_only
2   3     lawyer    OH         no  left_only
3   4    dentist    NY        yes  left_only

_merge表示(business, state)df1df2组合在哪些情况下可用。然后你只需要:

df['inBusiness'] = df._merge == 'both'

得到:

   id   business state inBusiness     _merge
0   1    painter    AL       True       both
1   2  insurance    AL      False  left_only
2   3     lawyer    OH      False  left_only
3   4    dentist    NY      False  left_only

答案 1 :(得分:1)

创建地图可能效率最高

inBusiness = {(business,state): 'yes' 
              for business,state in zip(df2['business'],df2['state'])}
df1['inBusiness'] = [ inBusiness.get((business,state),"no") 
                     for business,state in zip(df1['business'],df1['state'])]
df1

OUTPUTS

    id  business    state   inBusiness
0   1   painter     AL  yes
1   2   insurance   AL  no
2   3   lawyer      OH  no
3   4   dentist     NY  no

解释编辑:

你对“进一步解释”很模糊,所以我会给出高水平的一切

内置zip函数,它接受两个迭代(如两个列表,或两个系列),并将它们“拉链”成元组。

a = [1,2,3]
b = ['a','b','c']
for tup in zip(a,b): print(tup)

输出:

(1, 'a')
(2, 'b')
(3, 'c')

此外,python中的元组可以“解包”到单个变量中

tup = (3,4)
x,y = tup
print(x)
print(y)

您可以将这两件事结合起来创建dictionary comprehensions

newDict = {k: v for k,v in zip(a,b)}
newDict

输出:

{1: 'a', 2: 'b', 3: 'c'}

inBusiness是在将df2['business']df2['state']系列压缩在一起后使用字典理解创建的python字典。

我实际上并不需要解压缩变量,但我这样做是因为我认为它是清晰的。

请注意,此地图只是您要跳过的地图的一半,因为字典中的每个键(business,state)都会映射到yes。值得庆幸的是,如果找不到密钥,dict.get让我们specify a default value返回 - 在您的情况下是"no"

然后,使用list-comprehension创建所需的列以获得所需的结果。

这涵盖了一切吗?