我有一个大约500K记录的文件。 每条记录都需要进行验证。 记录是重复的并存储在列表中:
with open(filename) as f:
records = f.readlines()
我使用的验证文件存储在Pandas Dataframe中 此DataFrame包含大约80K记录和9列(myfile.csv)。
filename = 'myfile.csv'
df = pd.read_csv(filename)
def check(df, destination):
try:
area_code = destination[:3]
office_code = destination[3:6]
subscriber_number = destination[6:]
if any(df['AREA_CODE'].astype(int) == area_code):
area_code_numbers = df[df['AREA_CODE'] == area_code]
if any(area_code_numbers['OFFICE_CODE'].astype(int) == office_code):
matching_records = area_code_numbers[area_code_numbers['OFFICE_CODE'].astype(int) == office_code]
start = subscriber_number >= matching_records['SUBSCRIBER_START']
end = subscriber_number <= matching_records['SUBSCRIBER_END']
# Perform intersection
record_found = matching_records[start & end]['LABEL'].to_string(index=False)
# We should return only 1 value
if len(record_found) > 0:
return record_found
else:
return 'INVALID_SUBSCRIBER'
else:
return 'INVALID_OFFICE_CODE'
else:
return 'INVALID_AREA_CODE'
except KeyError:
pass
except Exception:
pass
我正在寻找一种改进比较的方法,因为当我运行它时,它只是挂起。如果我用一个小子集(10K)运行它,它工作正常。 不确定是否有更有效的表示法/推荐。
for record in records:
check(df, record)
使用MacOS 8GB / 2.3 GHz Intel Core i7。
单独使用检查功能中的Cprofile.run显示:
4253 function calls (4199 primitive calls) in 0.017 seconds.
因此,我假设500K需要大约2 1/2小时
答案 0 :(得分:2)
虽然没有可用的数据,但请考虑这种未经测试的方法,并将两个数据块的左连接合并,然后运行验证步骤。这样可以避免跨列的任何循环和运行条件逻辑:
import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)