我有两个大型数据框df1 --> 100K
行和df2 --> 600K
行,它们如下所示
# df1
name price brand model
0 CANON CAMERA 20 FS36dINFS MEGAPIXEL 9900.0 CANON FS36dINFS
1 SONY HD CAMERA 25 MEGAPIXEL 8900.0 SONY
2 LG 55" 4K UHD LED Smart TV 55UJ635V 5890.0 LG 55UJ635V
3 Sony 65" LED Smart TV KD-65XD8505BAE 4790.0 SONY KD-65XD8505BAE
4 LG 49" 4K UHD LED Smart TV 49UJ651V 4390.0 LG 49UJ651V
#df2
name store price
0 LG 49" 4K UHD LED Smart TV 49UJ651V storeA 4790.0
1 SONY HD CAMERA 25 MEGAPIXEL storeA 12.90
2 Samsung 32" LED Smart TV UE-32J4505XXE storeB 1.30
如果df1中的品牌和其他功能在df2中,我想匹配,如果它们存在,那么我会做一些事情。目前我正在使用一种天真的方法来迭代这两个数据帧,如下所示
datalist = []
for idx1, row1 in df1.iterrow():
for idx2, row2 in df2.iterrows():
if(row1['brand'] in row2['name'] and row1['model'] in row2['name']):
datalist.append([row1['model'], row1['brand'], row1['name'], row1['price'], row2['name'],row2['price'], row2['store']])
但这需要花费很多时间,因为两个数据帧都很大。我研究过集合更快但是在这里,我使用iterrows使用数据帧的方式我无法转换为set因为那时我将失去位置。有没有更快的事情呢?
答案 0 :(得分:2)
如果df1['brand']
和df1['model']
中有大量重复,那么您可以通过为品牌和模型创建正则表达式来提高性能:
brands = '({})'.format('|'.join(df1['brand'].dropna().unique()))
# '(CANON|SONY|LG)'
models = '({})'.format('|'.join(df1['model'].dropna().unique()))
# '(FS36dINFS|55UJ635V|KD-65XD8505BAE|49UJ651V)'
然后,您可以使用str.extract
方法从df2['name']
中查找品牌和型号字符串:
df2['brand'] = df2['name'].str.extract(brands, expand=False)
df2['model'] = df2['name'].str.extract(models, expand=False)
然后,您可以通过执行内部合并来获取DataFrame形式的所需数据:
result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')
import re
import sys
import pandas as pd
pd.options.display.width = sys.maxsize
df1 = pd.DataFrame({'brand': ['CANON', 'SONY', 'LG', 'SONY', 'LG'], 'model': ['FS36dINFS', None, '55UJ635V', 'KD-65XD8505BAE', '49UJ651V'], 'name': ['CANON CAMERA 20 FS36dINFS MEGAPIXEL', 'SONY HD CAMERA 25 MEGAPIXEL', 'LG 55" 4K UHD LED Smart TV 55UJ635V', 'Sony 65" LED Smart TV KD-65XD8505BAE', 'LG 49" 4K UHD LED Smart TV 49UJ651V'], 'price': [9900.0, 8900.0, 5890.0, 4790.0, 4390.0]})
df2 = pd.DataFrame({'name': ['LG 49" 4K UHD LED Smart TV 49UJ651V', 'SONY HD CAMERA 25 MEGAPIXEL', 'Samsung 32" LED Smart TV UE-32J4505XXE'], 'price': [4790.0, 12.9, 1.3], 'store': ['storeA', 'storeA', 'storeB']})
bm = ['brand','model']
for col in bm:
keywords = [re.escape(item) for item in df1[col].dropna().unique()]
pat = '({})'.format('|'.join(keywords))
df2[col] = df2['name'].str.extract(pat, expand=False)
result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')
print(result)
产量
brand model name_x price_x name_y price_y store
0 LG 49UJ651V LG 49" 4K UHD LED Smart TV 49UJ651V 4390.0 LG 49" 4K UHD LED Smart TV 49UJ651V 4790.0 storeA