熊猫:将各种相似的子字符串映射为单个标准格式

时间:2019-07-15 15:00:49

标签: python pandas

Dataframe列具有公司名称的各种格式子字符串,这些子字符串需要映射到公司名称的固定表示形式。这些多种格式记录在sotest.json中:

{
    "ABERCOMBIEFITCH": ["A&F", "A & F", "A& F", "ABERCOMBIE & FITCH"],
    "COCACOLA": ["COKE", "COCA-COLA", "COCACOLA"]
}

此json如下读取到df中:

with open('sotest.json') as tf:
    testdata = json.load(tf)
indexlist = []
itemslist = []
for k, v in testdata.items():
    indexlist.append(k)
    itemslist.append(v)
sojsondf = pd.DataFrame({'AssortedNames': itemslist}, index = indexlist)

下面是一个test-df:

namesdf = pd.DataFrame(data = ["A&F Ltd", "A & F CO", "A& F COMPANY", "ABERCOMBIE & FITCH LIMITED", 
                               "COKE M/S", "COCA-COLA COMPANY", "COCACOLA BOTTLING CO", "SONY"], 
                      columns = ['RecordedCompanyName'])

以下功能应用于上面的df列以获得标准化输出:

def sorowchecker(inputstring, sojsondf):
    match = False
    for i, row in sojsondf.iterrows():
        if any(sponsor in inputstring for sponsor in row['AssortedNames']):
            match = True
            if match == True:
                break
    return i if match == True else "DIRECTMARKETING"

使用以上功能:

   namesdf['Company'] = namesdf['RecordedCompanyName'].apply(sorowchecker, args=(sojsondf, ))

实际名称为df.shape [0]〜60k,实际名称为sojsondf.shape [0]〜50,这意味着该程序需要花费相当长的时间。是否有人建议如何使sorowchecker()运行得更快和/或进行其他改进(对于使用并发的任何事物都非常赞誉)?谢谢

2 个答案:

答案 0 :(得分:1)

我使用BaseClass预编译正则表达式,然后在__init__中使用它们将其替换为“ canonical”名称,并使用testdata仅获得替换的部分。 / p>

此后,列表中未包含replace的每一行都将替换为map

您能看看这是否适合您吗?

'Company'

输出:

'DIRECTMARKETING

答案 1 :(得分:1)

IIUC,您不需要创建新的数据框,只需使用dict创建一个反向dict和map

with open('sotest.json') as tf:
    testdata = json.load(tf)

backward = {x:k for k,v in testdata.items() for x in v}

# pattern to check if any key in the names
pattern = '|'.join(backward.keys())

# output:    
(namesdf['RecordedCompanyName']
 .str.extract(f'({pattern})')[0]   # extract the first match key
 .map(backward)                    # convert the match key to actual name
 .fillna('DIRECTMARKETING')        # replace the none-match with default
)

输出:

0    ABERCOMBIEFITCH
1    ABERCOMBIEFITCH
2    ABERCOMBIEFITCH
3    ABERCOMBIEFITCH
4           COCACOLA
5           COCACOLA
6           COCACOLA
7    DIRECTMARKETING
Name: 0, dtype: object