Question

我有一个大约500K记录的文件。每条记录都需要进行验证。记录是重复的并存储在列表中：

with open(filename) as f:
    records = f.readlines()

我使用的验证文件存储在Pandas Dataframe中此DataFrame包含大约80K记录和9列（myfile.csv）。

filename = 'myfile.csv'
df = pd.read_csv(filename)

def check(df, destination):
    try:
        area_code = destination[:3]
        office_code = destination[3:6]
        subscriber_number = destination[6:]


        if any(df['AREA_CODE'].astype(int) == area_code):
            area_code_numbers = df[df['AREA_CODE'] == area_code]
            if any(area_code_numbers['OFFICE_CODE'].astype(int) == office_code):
                matching_records = area_code_numbers[area_code_numbers['OFFICE_CODE'].astype(int) == office_code]

                start = subscriber_number >= matching_records['SUBSCRIBER_START']
                end = subscriber_number <= matching_records['SUBSCRIBER_END']
                # Perform intersection
                record_found = matching_records[start & end]['LABEL'].to_string(index=False)
                # We should return only 1 value
                if len(record_found) > 0:
                    return record_found
                else:
                    return 'INVALID_SUBSCRIBER'
            else:                   
                return 'INVALID_OFFICE_CODE'
        else:               
            return 'INVALID_AREA_CODE'
    except KeyError:
        pass
    except Exception:
        pass

我正在寻找一种改进比较的方法，因为当我运行它时，它只是挂起。如果我用一个小子集（10K）运行它，它工作正常。不确定是否有更有效的表示法/推荐。

for record in records:
    check(df, record)

使用MacOS 8GB / 2.3 GHz Intel Core i7。

单独使用检查功能中的Cprofile.run显示：

4253 function calls (4199 primitive calls) in 0.017 seconds.

因此，我假设500K需要大约2 1/2小时

Answer 1

虽然没有可用的数据，但请考虑这种未经测试的方法，并将两个数据块的左连接合并，然后运行验证步骤。这样可以避免跨列的任何循环和运行条件逻辑：

import pandas as pd
import numpy as np

with open('RecordsValidate.txt') as f:
    records = f.readlines()
    print(records)

rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
                    'rcd_area_code': [int(rcd[:3]) for rcd in records],
                    'rcd_office_code': [int(rcd[3:6]) for rcd in records],
                    'rcd_subscriber_number': [rcd[6:] for rcd in records]})

filename = 'myfile.csv'
df = pd.read_csv(filename)

# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)

mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)

# VALIDATE OFFICE CODE                         
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
                 right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])

# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
                           (mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
                           (mrgdf['LABEL'].str.len() =  0),
                           'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)

多条记录的Pandas优化

1 个答案: