Question

我有一个10M的行数据集和一个2万的行数据集，我在其中使用了两个熊猫应用函数。即使在一个子集（1000行）上，第一个apply函数也需要花费几分钟的时间才能运行。我想调试第二个apply函数，但是不必重新运行整个模块并每次都要等待几分钟。我已经阅读了文档，但是对于如何实现这一点我仍然感到困惑。

功能如下：

第一个功能

def match_name(row, pubdf):
    # pdb.set_trace()
    if row is not None and row !='nan':
        minscore = 90
        choice, score = fwp.extractOne(row, pubdf)
        return choice if score > minscore else None
    else:
        return None

第二个功能

def match_address(privrow):
    pdb.set_trace()
    minscore = 90
    if privrow.supplier_streetadd:
        if privrow.supplier_streetadd != 'nan':
            pub_addresses = [privrow.pub_streetadd1,
                             privrow.pub_streetadd2,
                             privrow.pub_streetadd3]
            choice, score = fwp.extractOne(privrow.supplier_streetadd, pub_addresses)
            return choice if score > minscore else None
    else:
        return None

运行这些代码段：

tqdm.pandas()

priv_df['pub_org_name']=priv_df['supplier_name'].progress_apply(match_name, args=(list(pub_df['org_name']),))

priv_df['pub_streetaddress'] = priv_df.progress_apply(match_address, axis = 1)

PDB Python调试-从特定功能重新启动，而无需重新运行整个模块

0 个答案: