我有2个文件:
文件1:
Get-Content .\test.json | Select-String "Mounts" -Context 5
"UpperDir": "/var/lib/docker/overlay2/232d9803e4cdb9ec031561cfae7589a43a7fdc06e0517cc3f17417e44d5174fb/diff",
"WorkDir": "/var/lib/docker/overlay2/232d9803e4cdb9ec031561cfae7589a43a7fdc06e0517cc3f17417e44d5174fb/work"
},
"Name": "overlay2"
},
> "Mounts": [],
"Config": {
"Hostname": "c8a5810c62d8",
"Domainname": "",
"User": "",
"AttachStdin": false,
文件2:
website,company_name,uuid
yahoo.com,Yahoo,1111
facebook.com,Facebook,2222
cisco.com,Cisco Inc,3333
,Oasis,4444
google.com,Google Inc,5555
我的脚本需要匹配网站或公司名称并生成如下输出:
company_url,company_name
facebook.com,
google.com,Google
,Netflix
,Pandas
当前正在执行以下操作:
company_url,company_name,uuid
facebook.com,,2222
google.com,google,5555
,netflix,
,pandas,
似乎在File1的空白网站上匹配。
代码:
company_url,company_name,uuid
facebook.com,,2222.0
google.com,google,5555.0
,netflix,4444
,pandas,4444
功能:
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
我使用了这个技巧:
def MatchCompetitorCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields.
companies = companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
competitor_companies = competitor_companies.fillna('')
# Rename column to TLD to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Rename column to TLD since Competitor already contains TLD in company_url.
competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
match_companies = len(
competitor_companies[competitor_companies['uuid'].notnull()])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
return competitor_companies
但是希望看看是否有更好的解决方案。
答案 0 :(得分:1)
一方面,合并键df1
/ df2
上的'website'
和'company_url'
(目的:提取'company_url'
(完整)和'uuid'
(部分信息)
df3 = (df1
.merge(df2, left_on='website', right_on='company_url', how='right')
.drop(['website', 'company_name_x'], axis=1)
)
df3 = (df3
.loc[~(df3.company_url.isnull())]
.reset_index(drop=True)
.rename(columns={'company_name_y': 'company_name'})
)
df3
:
uuid company_url company_name
0 2222 facebook.com NaN
1 5555 google.com Google
另一方面,将df1
和df2
合并到键'company_name'
上(目的:提取其他'uuid'
(部分)信息,在此玩具示例中不需要)
df4 = (df1
.merge(df2, on='company_name', how='right')
.drop(['website', 'company_url'], axis=1)
)
df4 = (df4
.loc[~df4.company_name.isnull()]
.reset_index(drop=True)
)
df4
:
company_name uuid
0 Google NaN
1 Netflix NaN
2 Pandas NaN
合并df3
和df4
,其中'company_name
'不为空,然后再将df4
附加到合并数据帧中,其中'company_name'
为空
df5 = (df3
.loc[~df3.company_name.isnull()]
.merge(df4, on='company_name', how='right')
)
df5['uuid'] = df5['uuid_x'].fillna(df5['uuid_y'])
df5 = (df5
.drop(['uuid_x', 'uuid_y'], axis=1)
.append(df3.loc[df3.company_name.isnull()], ignore_index=True, sort=False)
.fillna("")
)
df5['company_name'] = df5.company_name.str.lower()
df5
:
company_name company_url uuid
0 google google.com 5555
1 netflix
2 pandas
3 facebook.com 2222
希望这会有所帮助。