(我正在尝试更新hunspell拼写字典) 我的同义词文件看起来像这样......
mylist="""
specimen|3
sample
prototype
example
sample|3
prototype
example
specimen
prototype|3
example
specimen
sample
example|3
specimen
sample
prototype
protoype|1
illustration
"""
第一步是合并重复的单词。在上面提到的例子中,单词" prototype"重复。所以我需要将它联合起来。计数将从3变为4,因为"插图"同义词已添加。
specimen|3
sample
prototype
example
sample|3
prototype
example
specimen
prototype|4
example
specimen
sample
illustration
example|3
specimen
sample
prototype
第二步更复杂。合并重复项是不够的。添加的单词也应反映到链接的单词。在这种情况下,我需要搜索"原型"在同义词列表中,如果找到,"插图"这个词应该加上。 最终的单词列表将如下所示......
specimen|4
sample
prototype
example
illustration
sample|4
prototype
example
specimen
illustration
prototype|4
example
specimen
sample
illustration
example|4
specimen
sample
prototype
illustration
一个新词"插图"应该添加到包含所有4个链接词的原始列表中。
illustration|4
example
specimen
sample
prototype
我尝试过:
myfile=StringIO.StringIO(mylist)
for lineno, i in enumerate(myfile):
if i:
try:
if int(i.split("|")[1]) > 0:
print lineno, i.split("|")[0], int(i.split("|")[1])
except:
pass
上面的代码返回带行号和计数的单词。
1 specimen 3
5 sample 3
9 prototype 3
13 example 3
17 protoype 1
这意味着我需要将第18行的1个单词与第9行(" prototype")的第4个位置的单词合并。 如果我能做到这一点,我将完成任务的第1步。
答案 0 :(得分:3)
使用图表:
mylist="""
specimen|3
sample
prototype
example
sample|3
prototype
example
specimen
prototype|3
example
specimen
sample
example|3
specimen
sample
prototype
prototype|1
illustration
specimen|1
cat
happy|2
glad
cheerful
"""
import networkx as nx
G = nx.Graph()
nodes = []
for line in mylist.strip().splitlines():
if '|' in line:
node, _ = line.split('|')
if node not in nodes:
nodes.append(node)
G.add_node(node)
else:
G.add_edge(node, line)
if line not in nodes:
nodes.append(line)
for node in nodes:
neighbors = G.neighbors(node)
non_neighbors = []
for non_nb in nx.non_neighbors(G, node):
try:
if nx.bidirectional_dijkstra(G, node, non_nb):
non_neighbors.append(non_nb)
except Exception:
pass
syns = neighbors + non_neighbors
print '{}|{}'.format(node, len(syns))
print '\n'.join(syns)
<强>输出:强>
specimen|5
sample
prototype
example
cat
illustration
sample|5
specimen
prototype
example
illustration
cat
prototype|5
sample
specimen
example
illustration
cat
example|5
sample
specimen
prototype
illustration
cat
illustration|5
prototype
specimen
cat
sample
example
cat|5
specimen
illustration
sample
prototype
example
happy|2
cheerful
glad
glad|2
happy
cheerful
cheerful|2
happy
glad
图表看起来像:
答案 1 :(得分:1)
您描述的问题是经典的Union-Find问题,可以使用不相交的集合算法来解决。不要重新发明轮子。
阅读Union-Find / Disjoint集:
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
或问题:
Union find implementation using Python
class DisjointSet(object):
def __init__(self):
self.leader = {} # maps a member to the group's leader
self.group = {} # maps a group leader to the group (which is a set)
def add(self, a, b):
leadera = self.leader.get(a)
leaderb = self.leader.get(b)
if leadera is not None:
if leaderb is not None:
if leadera == leaderb: return # nothing to do
groupa = self.group[leadera]
groupb = self.group[leaderb]
if len(groupa) < len(groupb):
a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
groupa |= groupb
del self.group[leaderb]
for k in groupb:
self.leader[k] = leadera
else:
self.group[leadera].add(b)
self.leader[b] = leadera
else:
if leaderb is not None:
self.group[leaderb].add(a)
self.leader[a] = leaderb
else:
self.leader[a] = self.leader[b] = a
self.group[a] = set([a, b])
mylist="""
specimen|3
sample
prototype
example
sample|3
prototype
example
specimen
prototype|3
example
specimen
sample
example|3
specimen
sample
prototype
prototype|1
illustration
specimen|1
cat
happy|2
glad
cheerful
"""
ds = DisjointSet()
for line in mylist.strip().splitlines():
if '|' in line:
node, _ = line.split('|')
else:
ds.add(node, line)
for _,g in ds.group.items():
print g
>>>
set(['specimen', 'illustration', 'cat', 'sample', 'prototype', 'example'])
set(['cheerful', 'glad', 'happy'])
使用dijkstra算法可以解决这个问题,但我认为这是一种矫枉过正,因为你实际上并不需要节点之间的最短距离,你只需要图中连接的组件。