假设我有一个JSON列表,如示例中所示。在具有重复title
属性的情况下(通过对Levenshtein距离的某个阈值进行评分来确定),我想过滤掉在另一个属性中没有最小值的重复项(sourceRank
)。
这是我关于如何执行此操作的想法,但是索引已损坏。最有效的方法是什么?
articles = [
{'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0},
{'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0},
{'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0},
{'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0},
{'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0},
]
print len(articles)
print [a['_source']['title'] for a in articles]
def levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
indices = []
for i1, a1 in enumerate(articles):
for i2, a2 in enumerate(articles):
if levenshtein_distance(a1['_source']['title'], a2['_source']['title']) > .8:
if a1['_source']['sourceRank'] > a2['_source']['sourceRank']:
indices += [i1]
else:
indices += [i2]
articles = [i for j, i in enumerate(articles) if j not in indices]
print len(articles)
print [a['_source']['title'] for a in articles]
答案 0 :(得分:0)
您的问题的要点似乎是从列表中删除重复的标题,同时确保其余标题的sourceRank最低。 我不知道sourRank值有多高,所以我只是以100000的价格刺入了前哨值。
#!/usr/bin/env python3
import itertools
articles = [
{'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0}},
{'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0}},
{'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0}},
{'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0}},
{'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0}}
]
def reducer(iter_):
max_rank = 100000
retval = None
for value in iter_:
current_rank = value["_source"]["sourceRank"]
if current_rank < max_rank:
max_rank = current_rank
retval = value
return retval
for title, _source in itertools.groupby(articles, lambda x: x["_source"].get("title")):
print(reducer(_source))