无法分组和总csv文件

时间:2017-05-01 14:39:57

标签: python python-2.7

我创建了一个csv文件,其中包含两列作者和书籍数量 - 请参阅示例(道歉不能在下面看起来像一张桌子,但第1列有作者,第二列在这个图中只有1号)

Vincent 1
Vincent 1
Vincent 1
Vincent 1
Thomas  1
Thomas  1
Thomas  1
Jimmy   1
Jimmy   1

我正在尝试创建一个输出csv,由作者总结书籍,即Vincent 5,Thomas 3和Jimmy 2

使用下面的代码,我设法进入中间阶段,我得到每个作者的累计总数。行print line[0], countAuthor产生哪个很好

Vincent 1
Vincent 2
Vincent 3
Vincent 4
Thomas  1
Thomas  2
Thomas  3
Jimmy   1
Jimmy   2

然后我计划将此输出转换为列表,将其降序排序并仅保留具有最高值的记录,即当前作者与之前的作者相同然后跳过 - 然后我将输出写入另一个csv文件

我的问题是我无法将作者和累计总数写入列表 - 我可以将其变为变量w。 print w[2]有效但print data[2]没有,因为数据似乎只有一列。任何帮助将不胜感激,因为我花了将近两天的时间没有太多运气 - 我被迫使用csv,因为完整的文件有空白的作者姓名等

with open("testingtesting6a.csv") as inf:
data = []
author = 'XXXXXXXX'
countAuthor = 0.0
for line in inf:
    line = line.split(",")
    if line[0] == author:
        countAuthor = countAuthor + float(line[1])
    else:
        countAuthor = float(line[1])
        author = line[0]

    # print line[0], countAuthor

    w = (line[0],line[1],countAuthor)
    print w[2]
    data.append(w)
    print data[2]
    # print data[0]
    # print type(w)
    # print w[2]

2 个答案:

答案 0 :(得分:0)

标准库已涵盖此内容。

import collections

def sum_up(input_file):
    counter = collections.defaultdict(int)
    for line in input_file:
        parts = line.split()  # splits by any whitespace.
        if len(parts) != 2:
          continue  # skip the line that does not parse; maybe a blank line.
        name, number = parts
        counter[name] += int(number)  # you can't borrow 1.25 books.
    return counter

现在你可以:

with open('...') as f:
  counts = sum_up(f)

for name, count in sorted(counts.items()):
  print name, count  # prints counts sorted by name.

print counts['Vincent']  # prints 4.

print counts['Jane']  # prints 0.

这里的诀窍是使用defaultdict,一个冒充任何键值的字典。我们要求它具有int()生成的默认值,即0。

答案 1 :(得分:0)

使用strip删除空格,使用Pandas分组:

输入文件(adtional space is intentional):

author,books
Vincent, 1
Vincent , 1
Vincent, 1
Vincent, 1
Thomas  ,  1
Thomas,  1
Thomas,  1
Jimmy,   1
Jimmy  ,   1

import csv
import pandas as pd

fin = open('author.csv', 'r')
reader = csv.DictReader(fin, delimiter=',')

# strip remove spaces
authors=[( (d['author']).strip(), int((d['books']).strip())) for d in reader]

df = pd.DataFrame(authors)
df.columns = ['author', 'books']
df2 = (df.groupby('author').sum())
print (df2)    

         books
author        
Jimmy        2
Thomas       3
Vincent      4

# For total of books:
print (df2.books.sum())
9