Question

我有一个制表符分隔文件，我需要从中提取所有第12列内容（哪些文档类别）。然而，第12列内容是高度重复的，所以首先我需要获得一个只返回类别数量的列表（通过删除重复）。然后我需要找到一种方法来获得每个类别的行数。我的尝试如下：

def remove_duplicates(l): # define function to remove duplicates
    return list(set(l))

input = sys.argv[1] # command line arguments to open tab file
infile = open(input)
for lines in infile: # split content into lines
    words = lines.split("\t") # split lines into words i.e. columns
    dataB2.append(words[11]) # column 12 contains the desired repetitive categories
    dataB2 = dataA.sort() # sort the categories
    dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command
    print(len(dataB2))
infile.close()

我不知道如何获得每个类别的行数？所以我的问题是：如何有效地消除重复？以及如何获得每个类别的行数？

Answer 1

我建议使用python http://www.cnblogs.com/gaoxiang12/p/4659805.html来实现这一点。计数器几乎完全符合您的要求，因此您的代码如下所示：

from collections import Counter
import sys

count = Counter()

# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
  for lines in infile: # split content into lines
      words = lines.split("\t") # split lines into words i.e. columns
      count.update([words[11]])

print count

Answer 2

您需要做的就是从文件中读取每一行，按标签拆分，每行抓取第12行并将其放入列表中。（如果您不关心重复行，只需制作column_12 = set()并使用add(item)代替append(item)）。然后你只需使用len（）来获取集合的长度。或者，如果你想要两者，你可以使用一个列表，然后将其更改为一组。

编辑：计算每个类别（感谢Tom Morris提醒我事实上我没有回答这个问题）。您遍历column_12集合，以便不计算任何次数，并使用count()方法中构建的列表。

with open(infile, 'r') as fob:
    column_12 = []
    for line in fob:
        column_12.append(line.split('\t')[11])

print 'Unique lines in column 12 %d' % len(set(column_12))
print 'All lines in column 12 %d' % len(column_12)
print 'Count per catagory:'
for cat in set(column_12):
    print '%s - %d' % (cat, column_12.count(cat))

从制表符分隔文件的列表产品中删除重复项并进一步分类

2 个答案: