我在python中使用Pandas制作了两个数据帧:
df1
id business state inBusiness
1 painter AL no
2 insurance AL no
3 lawyer OH no
4 dentist NY yes
...........
df2
id business state
1 painter NY
2 painter AL
3 builder TX
4 painter AL
......
基本上,如果df2中存在完全相同的业务/位置组合的实例,我想将df1中的'inBusiness'值设置为'yes'。
因此,例如,如果df2中存在painter / AL,则df1中所有painter / AL实例的'inBusiness'值都设置为yes。
我现在能想到的最好的是:
for index, row in df2.iterrows():
df1[ (df1.business==str(row['business'])) & (df1.state==str(row['state']))]['inBusiness'] = 'Yes'
但第一个数据帧可能有数十万行要遍历第二个数据帧中的每一行,因此这种方法不太可靠。我可以在这里使用一个很好的单行程,也很快吗?
答案 0 :(得分:2)
您可以使用.merge(how='left', indicator=True)
(indicator
,see docs中添加pandas>=0.17
)来识别匹配的列以及匹配的来源,以获得这些内容:
df1.merge(df2, how='left', indicator=True) # merges by default on shared columns
id business state inBusiness _merge
0 1 painter AL no both
1 2 insurance AL no left_only
2 3 lawyer OH no left_only
3 4 dentist NY yes left_only
_merge
表示(business, state)
和df1
中df2
组合在哪些情况下可用。然后你只需要:
df['inBusiness'] = df._merge == 'both'
得到:
id business state inBusiness _merge
0 1 painter AL True both
1 2 insurance AL False left_only
2 3 lawyer OH False left_only
3 4 dentist NY False left_only
答案 1 :(得分:1)
创建地图可能效率最高
inBusiness = {(business,state): 'yes'
for business,state in zip(df2['business'],df2['state'])}
df1['inBusiness'] = [ inBusiness.get((business,state),"no")
for business,state in zip(df1['business'],df1['state'])]
df1
OUTPUTS
id business state inBusiness
0 1 painter AL yes
1 2 insurance AL no
2 3 lawyer OH no
3 4 dentist NY no
解释编辑:
你对“进一步解释”很模糊,所以我会给出高水平的一切
内置zip
函数,它接受两个迭代(如两个列表,或两个系列),并将它们“拉链”成元组。
a = [1,2,3]
b = ['a','b','c']
for tup in zip(a,b): print(tup)
输出:
(1, 'a')
(2, 'b')
(3, 'c')
此外,python中的元组可以“解包”到单个变量中
tup = (3,4)
x,y = tup
print(x)
print(y)
您可以将这两件事结合起来创建dictionary comprehensions
newDict = {k: v for k,v in zip(a,b)}
newDict
输出:
{1: 'a', 2: 'b', 3: 'c'}
inBusiness
是在将df2['business']
和df2['state']
系列压缩在一起后使用字典理解创建的python字典。
我实际上并不需要解压缩变量,但我这样做是因为我认为它是清晰的。
请注意,此地图只是您要跳过的地图的一半,因为字典中的每个键(business,state)
都会映射到yes
。值得庆幸的是,如果找不到密钥,dict.get
让我们specify a default value返回 - 在您的情况下是"no"
然后,使用list-comprehension创建所需的列以获得所需的结果。
这涵盖了一切吗?