我有2个类型的数据框,
d1 = {'Domain': ['amazon.com', 'apple.com', 'amazon.com','xyz.com'], 'Pattern': ['kindle','music','subscribe-and-save',''],'Other Important Info':['a','b','c','d']}
df1 = pd.DataFrame(d1)
d2 = {'Domain': ['google.com','google.com','amazon.com','amazon.com', 'youtube.com', 'amazon.com'], 'Url': ['https://google.com/kindle','https://google.com/','https://amazon.com/subscribe-and-save','https://amazon.com/abc','https://youtube.com/music','https:amazon.com/kindle']}
df2 = pd.DataFrame(d2)
主要目的是基于“域”以及“模式”位于“网址”中时合并两个数据框。
因此结果应为以下数据框
{'Domain':['amazon.com','amazon.com'],'Url':['https://amazon.com/subscribe-and-save','https:amazon.com/kindle'],'Other Important Info':['c','a']}
我目前的工作方式
def lookup_table(value, df):
out = None
list_items = df['Pattern'].tolist()
for item in list_items:
if item in value:
out = item
break
return out
df2['Pattern'] = df2['url'].apply(lambda x: lookup_table(x, df1[df1['Pattern']!='']))
merged = pd.merge(df2[df2['Pattern'].notnull()], df1[df1['Pattern']!=''],on=['Domain','Pattern'],how='left')
但是由于for循环,lookup_table函数花费的时间太长了
如何更快地执行此操作?在Windows上使用Python 2。
答案 0 :(得分:4)
df1
Domain Pattern Other Important Info
0 amazon.com kindle a
1 apple.com music b
2 amazon.com subscribe-and-save c
3 xyz.com
df2
Domain Url
0 google.com https://google.com/kindle
1 google.com https://google.com/
2 amazon.com https://amazon.com/subscribe-and-save
3 amazon.com https://amazon.com/abc
4 youtube.com https://youtube.com/music
5 amazon.com https:amazon.com/kindle
主要目的是基于“域”和“域”合并两个数据框。 当“模式”位于“网址”中时也是如此。
df = df1.merge(df2, on='Domain')
df.loc[df.apply(lambda x: x.Pattern in x.Url, axis=1)]
输出
Domain Pattern Other Important Info \
2 amazon.com kindle a
3 amazon.com subscribe-and-save c
Url
2 https:amazon.com/kindle
3 https://amazon.com/subscribe-and-save