匹配Pandas数据框中的一对多列

时间:2018-08-16 06:53:10

标签: python pandas dataframe

我在CSV文件中有2个数据集,使用大熊猫将每个文件转换为2个不同的数据框。

我想根据他们的网址找到类似的公司。我可以根据1个字段(Rule1)找到类似的公司,但我想更有效地进行比较,如下所示:

数据集1

uuid, company_name, website
YAHOO,Yahoo,yahoo.com    
CSCO,Cisco,cisco.com
APPL,Apple,

数据集2

company_name, company_website, support_website, privacy_website
Yahoo,,yahoo.com,yahoo.com
Google,google.com,,
Cisco,,,cisco.com

结果数据集

company_name, company_website, support_website, privacy_website, uuid
Yahoo,,yahoo.com,yahoo.com,YAHOO
Google,google.com,,
Cisco,,,cisco.com,CSCO
  • 数据集1包含约5万条记录。
  • 数据集2包含约400万条记录。

规则

  1. 如果数据集1中的字段网站与数据集2中的字段 company_website 相同,则提取标识符。

  2. 如果不匹配,请检查数据集1中的字段网站是否与数据集2中的字段 support_website 相同,提取标识符。

  3. 如果不匹配,请检查数据集1中的字段网站是否与数据集2中的字段 privacy_website 相同,提取标识符。

  4. 如果不匹配,请检查数据集1中的字段 company_name 是否与数据集2中的字段 company_name 相同,提取标识符。

  5. 如果不匹配,返回记录和标识符字段(UUID)将为空。

这是我当前的功能:

def MatchCompanies(
    companies: pandas.Dataframe,
    competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
  """Find Competitor companies in companies dataframe and generate a new list.

  Args:
    companies: A dataframe with company information from CSV file.
    competitor_companies: A dataframe with Competitor information from CSV file.

  Returns:
    A sequence of matched companies and their UUID.

  Raises:
    ValueError: No companies found.
  """

  if _IsEmpty(companies):
    raise ValueError('No companies found')
  # Clean up empty fields. Use extra space to avoid matching on empty TLD.
  companies.fillna({'website': ' '}, inplace=True)
  competitor_companies = competitor_companies.fillna('')
  logging.info('Found: %d records.', len(competitor_companies))
  # Rename column to TLD to compare matching companies.
  companies.rename(columns={'website': 'tld'}, inplace=True)
  logging.info('Cleaning up company name.')
  companies.company_name = companies.company_name.apply(_NormalizeText)
  competitor_companies.company_name = competitor_companies.company_name.apply(
      _NormalizeText)
  # Rename column to TLD since Competitor already contains TLD in company_website.
  competitor_companies.rename(columns={'company_website': 'tld'}, inplace=True)
  logging.info('Extracting UUID')
  merge_tld = competitor_companies.merge(
      companies[['tld', 'uuid']], on='tld', how='left')
  # Extracts UUID for company name matches.
  competitor_companies = competitor_companies.merge(
      companies[['company_name', 'uuid']], on='company_name', how='left')
  # Combines dataframes.
  competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
      merge_tld['uuid'])
  match_companies = len(
      competitor_companies[competitor_companies['uuid'].notnull()])
  total_companies = len(competitor_companies)
  logging.info('Results found: %d out of %d', match_companies, total_companies)
  competitor_companies.rename(columns={'tld': 'company_website'}, inplace=True)
  return competitor_companies

正在寻找使用哪个功能的建议?

2 个答案:

答案 0 :(得分:2)

Series的{​​{3}}与map一起使用,但必须有一个要求-df1['website']df1['company_name']中的值始终是唯一的:

df1 = df1.dropna()
s1 = df1.set_index('website')['uuid']
s2 = df1.set_index('company_name')['uuid']

w1 = df2['company_website'].map(s1)
w2 = df2['support_website'].map(s1)
w3 = df2['privacy_website'].map(s1)
c = df2['company_name'].map(s2)

df2['uuid'] = w1.combine_first(w2).combine_first(w3).combine_first(c)
print (df2)
  company_name company_website support_website privacy_website   uuid
0        Yahoo             NaN       yahoo.com       yahoo.com  YAHOO
1       Google      google.com             NaN             NaN    NaN
2        Cisco             NaN             NaN       cisco.com   CSCO

答案 1 :(得分:-1)

看看dataframe.merge。将A中的第三列重命名为company_website,然后执行类似的操作

A.merge(B, on='company_website', indicator=True)

至少应该照顾到第一条规则。