改善性能处理熊猫数据框(isin)

时间:2018-08-01 05:35:00

标签: python performance pandas

处理大文件时,我的Python脚本退出并出现OOO内存错误。 (适用于一小部分记录〜10K)

我正在处理2个文件:

  • companies.csv (文件大小19MB)〜43K记录
  • competitor_companies.csv (文件大小427MB)〜450万条记录

在文件1中,我有一个名为 uuid 的字段。

我需要比较:

  1. 如果文件1和文件2中的 company_name 相同,则将文件1中的 uuid 字段复制到competitor_companies数据框中。

  2. 如果文件1和文件2中的网站相同,则将文件1中的 uuid 字段复制到competitor_companies数据框中。

当我处理服务器中的文件(约30 GB RAM)时,脚本卡在此行中:

logging.info('Matching TLD.')
match_tld = competitor_companies.tld.isin(companies.tld)

然后脚本停止,我在/var/log/syslog中看到以下行:

Out of memory: Kill process 177106 (company_generat) score 923 or sacrifice child

Python代码:

def MatchCompanies(
    companies: pandas.Dataframe,
    competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
  """Find Competitor companies in companies dataframe and generate a new list.

  Args:
    companies: A dataframe with company information from CSV file.
    competitor_companies: A dataframe with Competitor information from CSV file.

  Returns:
    A sequence of matched companies and their UUID.

  Raises:
    ValueError: No companies found.
  """

  if _IsEmpty(companies):
    raise ValueError('No companies found')
  # Clean up empty fields.
  companies = companies.fillna('')
  logging.info('Found: %d records.', len(competitor_companies))
  competitor_companies = competitor_companies.fillna('')
  # Create a column to define if we found a match or not.
  competitor_companies['match'] = False
  # Add Top Level Domain (tld) column to compare matching companies.
  companies.rename(columns={'website': 'tld'}, inplace=True)
  logging.info('Cleaning up company name.')
  companies.company_name = companies.company_name.apply(_NormalizeText)
  competitor_companies.company_name = competitor_companies.company_name.apply(
      _NormalizeText)
  # Create a new column since AppAnnie already contains TLD in company_url.
  competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)
  logging.info('Matching TLD.')
  match_tld = competitor_companies.tld.isin(companies.tld)
  logging.info('Matching Company Name.')
  match_company_name = competitor_companies.company_name.isin(
      companies.company_name)
  # Updates match column if TLD or company_name or similar companies matches.
  competitor_companies['match'] = match_tld | match_company_name
  # Extracts UUID for TLD matches.
  logging.info('Extracting UUID')
  merge_tld = competitor_companies.merge(
      companies[['tld', 'uuid']], on='tld', how='left')
  # Extracts UUID for company name matches.
  merge_company_name = competitor_companies.merge(
      companies[['company_name', 'uuid']], on='company_name', how='left')
  # Combines dataframes.
  competitor_companies['uuid'] = merge_tld['uuid'].combine_first(
      merge_company_name['uuid'])
  match_companies = len(competitor_companies[competitor_companies['match']])
  total_companies = len(competitor_companies)
  logging.info('Results found: %d out of %d', match_companies, total_companies)
  competitor_companies.drop('match', axis=1, inplace=True)
  competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
  return competitor_companies

这是我读取文件的方式:

def LoadDataSet(filename: str) -> pandas.Dataframe:
  """Reads CSV file where company information is stored.

  Header information exists in CSV file.

  Args:
    filename: Source CSV file. Header is present in file.

  Returns:
    A pandas dataframe with company information.

  Raises:
     FileError: Unable to read filename.
  """
  with open(filename) as input_file:
    data = input_file.read()
    dataframe = pandas.read_csv(
        io.BytesIO(data), header=0, low_memory=False, memory_map=True)
    return dataframe.where((pandas.notnull(dataframe)), None)

正在寻找有关如何改进我的代码的建议吗?

运行时的顶级命令结果:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                             
190875 myuser   20   0 4000944   2.5g 107532 R 100.7   8.5   5:01.93 company_generat   

1 个答案:

答案 0 :(得分:0)

您为什么不直接使用pd.merge

您可以创建两个数据帧,一个用于company_name匹配,第二个用于website匹配,然后在每个这些数据帧上左合并competitor_companies

# Create 2 matching tables
c_website = companies[['uuid', 'website']].rename(columns={'uuid': 'uuid_from_website'})
c_name = companies[['uuid', 'company_name']].rename(columns={'uuid': 'uuid_from_name'})

# Merge on each of these tables
result = competitor_companies\
.merge(c_website, how='left', on='website')\
.merge(c_name, how='left', on='company_name')

然后,您需要调和这两个值,例如,优先使用uuid_from_name:

result['uuid'] = np.where(res.uuid_from_name.notnull(), res.uuid_from_name, res.uuid_from_website)
del result['uuid_from_name']
del result['uuid_from_website']

它应该比使用pd.Series.isin快得多。