处理大文件时,我的Python脚本退出并出现OOO内存错误。 (适用于一小部分记录〜10K)
我正在处理2个文件:
在文件1中,我有一个名为 uuid 的字段。
我需要比较:
如果文件1和文件2中的 company_name 相同,则将文件1中的 uuid 字段复制到competitor_companies数据框中。
如果文件1和文件2中的网站相同,则将文件1中的 uuid 字段复制到competitor_companies数据框中。
当我处理服务器中的文件(约30 GB RAM)时,脚本卡在此行中:
logging.info('Matching TLD.')
match_tld = competitor_companies.tld.isin(companies.tld)
然后脚本停止,我在/var/log/syslog
中看到以下行:
Out of memory: Kill process 177106 (company_generat) score 923 or sacrifice child
Python代码:
def MatchCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields.
companies = companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
competitor_companies = competitor_companies.fillna('')
# Create a column to define if we found a match or not.
competitor_companies['match'] = False
# Add Top Level Domain (tld) column to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Create a new column since AppAnnie already contains TLD in company_url.
competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)
logging.info('Matching TLD.')
match_tld = competitor_companies.tld.isin(companies.tld)
logging.info('Matching Company Name.')
match_company_name = competitor_companies.company_name.isin(
companies.company_name)
# Updates match column if TLD or company_name or similar companies matches.
competitor_companies['match'] = match_tld | match_company_name
# Extracts UUID for TLD matches.
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
merge_company_name = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = merge_tld['uuid'].combine_first(
merge_company_name['uuid'])
match_companies = len(competitor_companies[competitor_companies['match']])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.drop('match', axis=1, inplace=True)
competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
return competitor_companies
这是我读取文件的方式:
def LoadDataSet(filename: str) -> pandas.Dataframe:
"""Reads CSV file where company information is stored.
Header information exists in CSV file.
Args:
filename: Source CSV file. Header is present in file.
Returns:
A pandas dataframe with company information.
Raises:
FileError: Unable to read filename.
"""
with open(filename) as input_file:
data = input_file.read()
dataframe = pandas.read_csv(
io.BytesIO(data), header=0, low_memory=False, memory_map=True)
return dataframe.where((pandas.notnull(dataframe)), None)
正在寻找有关如何改进我的代码的建议吗?
运行时的顶级命令结果:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
190875 myuser 20 0 4000944 2.5g 107532 R 100.7 8.5 5:01.93 company_generat
答案 0 :(得分:0)
您为什么不直接使用pd.merge
?
您可以创建两个数据帧,一个用于company_name
匹配,第二个用于website
匹配,然后在每个这些数据帧上左合并competitor_companies
。
# Create 2 matching tables
c_website = companies[['uuid', 'website']].rename(columns={'uuid': 'uuid_from_website'})
c_name = companies[['uuid', 'company_name']].rename(columns={'uuid': 'uuid_from_name'})
# Merge on each of these tables
result = competitor_companies\
.merge(c_website, how='left', on='website')\
.merge(c_name, how='left', on='company_name')
然后,您需要调和这两个值,例如,优先使用uuid_from_name:
result['uuid'] = np.where(res.uuid_from_name.notnull(), res.uuid_from_name, res.uuid_from_website)
del result['uuid_from_name']
del result['uuid_from_website']
它应该比使用pd.Series.isin
快得多。