我有这样的数据:
sample event caller
A1 5 version1
A1 5 version2
A1 5 version3
A1 5 version4
A2 1 version1
A2 1 version3
A2 2 version1
A2 3 version1
A3 5 version4
A3 6 version1
A3 6 version2
A3 6 version4
B4 1 version1
B4 1 version2
B4 1 version3
B4 1 version4
这显示了特定版本的脚本(events
)为不同caller
调用的samples
。
例如,事件5
由样本version1
中的version2
,version3
,version4
和A1
调用:
A1 5 version1
A1 5 version2
A1 5 version3
A1 5 version4
1
,B4
,version1
和version2
version3
中的事件version4
B4 1 version1
B4 1 version2
B4 1 version3
B4 1 version4
这将构成一个包含两个成员的集合 - sample:B4, event:1
和sample:A1, event:5
样本1
中的事件A2
仅由版本1和版本3调用:
A2 1 version1
A2 1 version3
我试图计算每个来电者之间的交叉点,以便我可以看到 - 最终代表维恩图 - 例如:
version1
version1 and
version4` 这是我到目前为止所做的事情,我正在努力聚集所有事件:
#!/usr/bin/python
from collections import defaultdict
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("in_file")
args = parser.parse_args()
calls = defaultdict(list)
# put a list of callers into dictionary keyed by sample and event:
with open(args.in_file) as f:
for l in f:
parts = l.rstrip().split('\t')
(sample, event, caller) = parts[0:3]
calls[(sample,event)].append(caller)
# For each call, extract the version support
for call in calls:
s = set(calls[call])
printset = ', '.join(s)
print(printset, len(s))
('version4, version1, version2', 3)
('version1', 1)
('version4, version1, version2, version3', 4)
('version1', 1)
('version1, version3', 2)
('version4, version1, version2, version3', 4)
('version4', 1)
从这个玩具示例中,我试图获得的输出将是:
Set_size Callers
1 version4 + version1, version2
2 version1
2 version1 + version2 + version3 + version4
1 version1 + version3
1 version4
我的代表如下:
答案 0 :(得分:1)
听起来您想要计算样本中同一组呼叫者调用任何事件的次数。您当前的代码是一个良好的开端,但它只是在那里的一部分。您需要一个额外的数据结构来计算同一组调用者的出现次数。我建议使用collections.Counter
将匹配集合收集起来:
#!/usr/bin/python
from collections import defaultdict, Counter # new import here
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("in_file")
calls = defaultdict(list)
# put a list of callers into dictionary keyed by sample and event:
with open(args.in_file) as f:
for l in f:
parts = l.rstrip().split('\t')
(sample, event, caller) = parts[0:3]
calls[(sample,event)].append(caller)
counts = Counter(map(frozenset, calls.values())) # aggregate the data
for callers, count in counts.items():
print(count, " + ".join(callers), sep='\t') # loop over and print the results
我假设您并不关心来电者与之交叉的事件。如果您想分别计算不同事件的交叉点,则需要在添加到Counter
的值中包含更多数据。例如,你可以计算两个元组,结合事件和frozenset
个调用者:
counts = Counter((event, frozenset(callers)) for (sample, event), callers in calls.items())