Question

如果我有一个以数字开头的行文件，后跟一些文字，我怎么能看到这些数字是否总是跟着不同的文字？例如：

0 Brucella abortus Brucellaceae
0 Brucella ceti Brucellaceae
0 Brucella canis Brucellaceae
0 Brucella ceti Brucellaceae

所以在这里，我想知道0之后是3种不同的“类型”文本。

理想情况下，我可以将文件读入一个python脚本，输出的内容如下：

1:250
2:98
3:78
4:65
etc.

第一个数字是不同“文本”的数量，而:之后的数字将是这个数字发生的数量。

我有以下脚本来计算在不同数字中找到“文本”的次数，所以我想知道如何反转它所以我知道number有多少次不同的文本，以及有多少不同的文本。此脚本将numbers和“text”的文件放入字典中，但我不确定如何操作此字典以获得我想要的内容。

 #!/usr/bin/env python
 #Dictionary to broken species, genus, family

 fileIn = 'usearchclusternumgenus.txt'

 d = {}
 with open(fileIn, "r") as f:
         for line in f:
                 clu, gen, spec, fam = line.split()
                 d.setdefault(clu, []).append((spec))


 # Iterate through and find out how many times each key occurs
 vals = {}                       # A dictonary to store how often each value occurs.
 for i in d.values():
   for j in set(i):              # Convert to a set to remove duplicates
     vals[j] = 1 + vals.get(j,0) # If we've seen this value iterate the count
                                 # Otherwise we get the default of 0 and iterate it
 #print vals

 # Iterate through each possible freqency and find how many values have that count.
 counts = {}                     # A dictonary to store the final frequencies.
 # We will iterate from 0 (which is a valid count) to the maximum count
 for i in range(0,max(vals.values())+1):
     # Find all values that have the current frequency, count them
     #and add them to the frequency dictionary
     counts[i] = len([x for x in vals.values() if x == i])

for key in sorted(counts.keys()):
   if counts[key] > 0:
      print key,":",counts[key]`

Answer 1

使用collections.defaultdict() object作为工厂设置来跟踪不同的行，然后打印出所收集集的大小：

from collections import defaultdict

unique_clu = defaultdict(set)

with open(fileIn) as infh:
    for line in infh:
        clu, gen, spec, rest = line.split(None, 3)
        unique_clu[clu].add(spec)

for key in sorted(unique_clu):
    count = len(unique_clu[key])
    if count:
        print '{}:{}'.format(key, count)

创建字典并查看密钥是否始终具有相同的值

1 个答案: