Question

Dataframe列具有公司名称的各种格式子字符串，这些子字符串需要映射到公司名称的固定表示形式。这些多种格式记录在sotest.json中：

{
    "ABERCOMBIEFITCH": ["A&F", "A & F", "A& F", "ABERCOMBIE & FITCH"],
    "COCACOLA": ["COKE", "COCA-COLA", "COCACOLA"]
}

此json如下读取到df中：

with open('sotest.json') as tf:
    testdata = json.load(tf)
indexlist = []
itemslist = []
for k, v in testdata.items():
    indexlist.append(k)
    itemslist.append(v)
sojsondf = pd.DataFrame({'AssortedNames': itemslist}, index = indexlist)

下面是一个test-df：

namesdf = pd.DataFrame(data = ["A&F Ltd", "A & F CO", "A& F COMPANY", "ABERCOMBIE & FITCH LIMITED", 
                               "COKE M/S", "COCA-COLA COMPANY", "COCACOLA BOTTLING CO", "SONY"], 
                      columns = ['RecordedCompanyName'])

以下功能应用于上面的df列以获得标准化输出：

def sorowchecker(inputstring, sojsondf):
    match = False
    for i, row in sojsondf.iterrows():
        if any(sponsor in inputstring for sponsor in row['AssortedNames']):
            match = True
            if match == True:
                break
    return i if match == True else "DIRECTMARKETING"

使用以上功能：

   namesdf['Company'] = namesdf['RecordedCompanyName'].apply(sorowchecker, args=(sojsondf, ))

实际名称为df.shape [0]〜60k，实际名称为sojsondf.shape [0]〜50，这意味着该程序需要花费相当长的时间。是否有人建议如何使sorowchecker（）运行得更快和/或进行其他改进（对于使用并发的任何事物都非常赞誉）？谢谢

Answer 1

我使用BaseClass预编译正则表达式，然后在__init__中使用它们将其替换为“ canonical”名称，并使用testdata仅获得替换的部分。 / p>

此后，列表中未包含replace的每一行都将替换为map。

您能看看这是否适合您吗？

'Company'

输出：

'DIRECTMARKETING

Answer 2

IIUC，您不需要创建新的数据框，只需使用dict创建一个反向dict和map：

with open('sotest.json') as tf:
    testdata = json.load(tf)

backward = {x:k for k,v in testdata.items() for x in v}

# pattern to check if any key in the names
pattern = '|'.join(backward.keys())

# output:    
(namesdf['RecordedCompanyName']
 .str.extract(f'({pattern})')[0]   # extract the first match key
 .map(backward)                    # convert the match key to actual name
 .fillna('DIRECTMARKETING')        # replace the none-match with default
)

输出：

0    ABERCOMBIEFITCH
1    ABERCOMBIEFITCH
2    ABERCOMBIEFITCH
3    ABERCOMBIEFITCH
4           COCACOLA
5           COCACOLA
6           COCACOLA
7    DIRECTMARKETING
Name: 0, dtype: object

熊猫：将各种相似的子字符串映射为单个标准格式

2 个答案: