Dataframe列具有公司名称的各种格式子字符串,这些子字符串需要映射到公司名称的固定表示形式。这些多种格式记录在sotest.json中:
{
"ABERCOMBIEFITCH": ["A&F", "A & F", "A& F", "ABERCOMBIE & FITCH"],
"COCACOLA": ["COKE", "COCA-COLA", "COCACOLA"]
}
此json如下读取到df中:
with open('sotest.json') as tf:
testdata = json.load(tf)
indexlist = []
itemslist = []
for k, v in testdata.items():
indexlist.append(k)
itemslist.append(v)
sojsondf = pd.DataFrame({'AssortedNames': itemslist}, index = indexlist)
下面是一个test-df:
namesdf = pd.DataFrame(data = ["A&F Ltd", "A & F CO", "A& F COMPANY", "ABERCOMBIE & FITCH LIMITED",
"COKE M/S", "COCA-COLA COMPANY", "COCACOLA BOTTLING CO", "SONY"],
columns = ['RecordedCompanyName'])
以下功能应用于上面的df列以获得标准化输出:
def sorowchecker(inputstring, sojsondf):
match = False
for i, row in sojsondf.iterrows():
if any(sponsor in inputstring for sponsor in row['AssortedNames']):
match = True
if match == True:
break
return i if match == True else "DIRECTMARKETING"
使用以上功能:
namesdf['Company'] = namesdf['RecordedCompanyName'].apply(sorowchecker, args=(sojsondf, ))
实际名称为df.shape [0]〜60k,实际名称为sojsondf.shape [0]〜50,这意味着该程序需要花费相当长的时间。是否有人建议如何使sorowchecker()运行得更快和/或进行其他改进(对于使用并发的任何事物都非常赞誉)?谢谢
答案 0 :(得分:1)
我使用BaseClass
预编译正则表达式,然后在__init__
中使用它们将其替换为“ canonical”名称,并使用testdata
仅获得替换的部分。 / p>
此后,列表中未包含replace
的每一行都将替换为map
。
您能看看这是否适合您吗?
'Company'
输出:
'DIRECTMARKETING
答案 1 :(得分:1)
IIUC,您不需要创建新的数据框,只需使用dict创建一个反向dict和map
:
with open('sotest.json') as tf:
testdata = json.load(tf)
backward = {x:k for k,v in testdata.items() for x in v}
# pattern to check if any key in the names
pattern = '|'.join(backward.keys())
# output:
(namesdf['RecordedCompanyName']
.str.extract(f'({pattern})')[0] # extract the first match key
.map(backward) # convert the match key to actual name
.fillna('DIRECTMARKETING') # replace the none-match with default
)
输出:
0 ABERCOMBIEFITCH
1 ABERCOMBIEFITCH
2 ABERCOMBIEFITCH
3 ABERCOMBIEFITCH
4 COCACOLA
5 COCACOLA
6 COCACOLA
7 DIRECTMARKETING
Name: 0, dtype: object