我要数一个company_names_list_json file 其中不仅有简单的重复项,而且还有些类似:
FooTechDepartment
FooFinaceDepartment
BarCompany
BarCompanySalesDepartment
我首先用set()区分它
with open(json_file_name) as f_in:
companies_raw_data = json.load(f_in)
#distinct the companis
companies = set(companies_raw_data)
companies = sorted(list(companies))
公司:
In [212]: len(companies)
Out[212]: 472
In [227]: companies[40:50]
Out[227]:
['SpeedyCloud研发中心',
'SpeedyCloud研发部',
'The ONE',
'The ONE产品研发',
'The ONE产品研发部',
'TransferEasy',
'VIPKID',
'Weego Travel技术',
'ZingFront智线',
'ZingFront智线技术部']
我的想法是计算它们的长度,从1到max_length,
0,设置一个柜台来收集公司
counter = {}
1,找到one_letter_companis并将其删除
In[228]: one_letter_companies = [c for c in companies if len(c) == 1]
In[229]: len(one_letter_companies)
Out[229]: 0
2,找到two_letters_companies,并在收集它们后将其删除
In[230]: two_letters_companies = [c for c in companies if len(c) == 2]
In[231]: len(two_letters_companies)
Out[231]: 16
将它们添加到柜台
In[238]: for company in two_letters_companies:
...: value = [c for c in companies if c.startswith(company)]
...: counter[company] = value
...: for v in value:
...: companies.remove(v)
它显示:
In[239]: counter
Out[239]:
{'互拍': ['互拍'],
'博飞': ['博飞'],
'城宿': ['城宿'],
'小米': ['小米', '小米小米安全', '小米小米电视'],
'币信': ['币信', '币信开发部'],
'库神': ['库神', '库神技术部'],
'微创': ['微创', '微创ITO', '微创ITO事业部', '微创微创赴微软', '微创赴微软小冰'],
'掌控': ['掌控', '掌控移动研发'],
'汇游': ['汇游'],
'百度': ['百度', '百度百度度秘事业部', '百度视频'],
'知乎': ['知乎', '知乎商业广告事业部', '知乎工程效率组', '知乎知识市场', '知乎社区平台部'],
'知藏': ['知藏'],
'纽曼': ['纽曼'],
'维朗': ['维朗'],
'艺恩': ['艺恩'],
'贝壳': ['贝壳']}
完整代码:
counter = {}
while companies:
#separate the one_letter_companies
one_letter_companies = [c for c in companies if len(c) == 1]
if one_letter_companies:
counter["one_letter_companies"] = one_letter_companies
for c in one_letter_companies:
companies.remove(c)
#handle the companies whose name with more than 1 letter
#find the max_length
max_len = max([len(c) for c in companies]) + 1
for i in range(2,max_len):
n_letters_companies = [c for c in companies if len(c)==i]
if n_letters_companies:
for company in n_letters_companies:
value = [c for c in companies if c.startswith(company)]
counter[company] = value
#delete the found companied from the companies list
for v in value:
companies.remove(v)
输出:
In [259]: len(counter)
Out[259]: 391 #vs 472 in the set()
我正在学习算法,也希望深入研究python。
您能否以适当的算法或python库提供任何提示来解决该问题?
答案 0 :(得分:1)
这就是我要解决的方法。关键部分是按 length 对列表进行排序,因此部门总是紧随其后,因此我们可以在遇到任何部门之前将公司添加到结果字典中。然后,我们可以浏览公司,以了解当前公司(或部门)的公司是否存在,如果是部门,则将其添加到该条目中;如果是公司,则将其添加为关键字。
with open(json_file_name) as f_in:
companies_raw_data = json.load(f_in)
companies = sorted(set(companies_raw_data), key=len)
results = {}
for company in companies:
for key in results:
if company.startswith(key): # is a department
results[key].append(company)
break
else: # no break -- is not a department
results[company] = []
如果您使用以下方法进行循环,可能会更有效,但不太明显:
results = {}
for company in companies:
for i in range(len(company) - 1, 0, -1):
key = company[:i] # substring
if key in results:
results[key].append(company)
break
else: # no break -- is not a department
results[company] = []