我有一点思想缠绕,给出这样的数据:
data = [('topic1', (['apples', 'oranges'], 0.14975108213820515)),
('topic2', (['oranges', 'raisins'], 0.14975108213820515)),
('topic3', (['grapes', 'raisins'], 0.14975108213820515)),
('topic4', (['trees', 'flowers'], 0.14975108213820515))]
我想根据数组中的至少一个文本(在元组的第二个元素的第一个元素中)是否共同来连接主题。所以在上面的例子中:
topic1 is connected to topic2
topic2 is connected to topic1 and topic3
topic3 is connected to topic2
topic4 is unconnected
理想情况下,我的输出看起来像:
output = [(topic1,topic2),
(topic1,topic2, topic3),
(topic3, topic2),
(topic4)]
所以,给定像data
这样的输入我怎样才能获得像output
这样的输出。我认为itertools可能会以某种方式参与其中,但我现在真的陷入困境。
答案 0 :(得分:2)
有效的方法是使用set
s。
>>> set1= set(['apples', 'oranges'])
>>> set2= set(['oranges', 'raisins'])
>>> print len(set1.intersection(set2))
1
所以,基本上:
len
topic_text_sets= {topic:set(text) for topic,(text,_) in data}
topic_related= {}
for topic1, text1 in topic_text_sets.iteritems():
related= [topic2 for topic2, text2 in topic_text_sets.iteritems() if topic1!=topic2 and len(text1.intersection(text2))>0]
print related
topic1 ['topic2']
topic3 ['topic2']
topic2 ['topic1', 'topic3']
topic4 []
答案 1 :(得分:2)
您创建一个包含列表的字典以捕获连接:
connections = {}
for topic, (conns, some_number) in data:
for conn in conns:
connections.setdefault(conn, set()).add(topic)
这会将连接值映射到主题集。
现在你可以查找反向连接;如果订单不重要,只需获取所有连接值集的并集:
output = [tuple(set().union(*(connections[c] for c in conns)))
for topic, (conns, some_number) in data]
演示:
>>> data = [('topic1', (['apples', 'oranges'], 0.14975108213820515)),
... ('topic2', (['oranges', 'raisins'], 0.14975108213820515)),
... ('topic3', (['grapes', 'raisins'], 0.14975108213820515)),
... ('topic4', (['trees', 'flowers'], 0.14975108213820515))]
>>> connections = {}
>>> for topic, (conns, some_number) in data:
... for conn in conns:
... connections.setdefault(conn, set()).add(topic)
...
>>> [tuple(set().union(*(connections[c] for c in conns)))
... for topic, (conns, some_number) in data]
[('topic1', 'topic2'), ('topic1', 'topic3', 'topic2'), ('topic3', 'topic2'), ('topic4',)]
>>> from pprint import pprint
>>> pprint(_)
[('topic1', 'topic2'),
('topic1', 'topic3', 'topic2'),
('topic3', 'topic2'),
('topic4',)]
否则先将topic
从第一组中删除,将output = [(topic,) + tuple(set().union(*(connections[c] for c in conns)) - {topic})
for topic, (conns, some_number) in data]
>>> [(topic,) + tuple(set().union(*(connections[c] for c in conns)) - {topic})
... for topic, (conns, some_number) in data]
[('topic1', 'topic2'), ('topic2', 'topic1', 'topic3'), ('topic3', 'topic2'), ('topic4',)]
>>> pprint(_)
[('topic1', 'topic2'),
('topic2', 'topic1', 'topic3'),
('topic3', 'topic2'),
('topic4',)]
移到前面:
{{1}}
答案 2 :(得分:1)
带有两个for循环的简单:
>>> for i in range(len(data)):
... x = set(data[i][1][0])
... for j in range(len(data)):
... if len(x & set(data[j][1][0]))>=1:
... print data[j][0], # for python 3 use print()
... print
...
topic1 topic2
topic1 topic2 topic3
topic2 topic3
topic4
答案 3 :(得分:0)
将其分解为子问题。
首先,您需要获取所有不同的文本,可能使用列表理解(或设置理解以避免重复)。然后你需要遍历它,并为每个文本找到data
中包含它作为其一部分的每个部分。你不应该使用itertools - 这可能会使它复杂化。