Question

我有一个带有电影类型的制表符分隔文件，以及两列中的年份：

Comedy  2013
Comedy  2014
Drama   2012
Mystery 2011
Comedy  2013
Comedy  2013
Comedy  2014
Comedy  2013
News    2012
Sport   2012
Sci-Fi  2013
Comedy  2014
Family  2013
Comedy  2013
Drama   2013
Biography   2013

我想按年份将这些类型组合在一起，并按以下格式打印出来（不一定按字母顺序排列）：

Year    2011    2012    2013    2014
Biography   0   0   1   0
Comedy  0   0   5   3
Drama   0   1   1   0   
Family      0   0   1   0
Mystery 1   0   0   0
News    0   1   0   0
Sci-Fi  0   0   1   0
Sport   0   1   0   0

我该如何接近它？目前我正在通过MS Excel创建输出，但我想通过Python来实现。

Answer 1

最简单的方法是使用pandas库，它提供了许多与数据表交互的方式：

df = pd.read_clipboard(names=['genre', 'year'])
df.pivot_table(index='genre', columns='year', aggfunc=len, fill_value=0)

输出：

year       2011  2012  2013  2014
genre                            
Biography     0     0     1     0
Comedy        0     0     5     3
Drama         0     1     1     0
Family        0     0     1     0
Mystery       1     0     0     0
News          0     1     0     0
Sci-Fi        0     0     1     0
Sport         0     1     0     0

如果您只是刚开始使用Python，您可能会发现尝试学习pandas在学习语言方面有点过分，但是一旦掌握了一些Python知识，{{1}提供了非常直观的数据交互方式。

Answer 2

如果您不想使用pandas，可以按以下方式执行：

from collections import Counter

# load file
with open('tab.txt') as f:
    lines = f.read().split('\n')

# replace separating whitespace with exactly one space
lines = [' '.join(l.split()) for l in lines]

# find all years and genres
genres = sorted(set(l.split()[0] for l in lines))
years = sorted(set(l.split()[1] for l in lines))

# count genre-year combinations
C = Counter(lines)

# print table
print "Year".ljust(10),
for y in years:
    print y.rjust(6),
print
for g in genres:
    print g.ljust(10),
    for y in years:
        print `C[g + ' ' + y]`.rjust(6),
    print

最有趣的函数可能是Counter，它计算每个元素的出现次数。为了确保分隔空格的长度不影响计数，我事先用一个空格替换它。

Python组由2列组成，输出多列

2 个答案: