Question

我有一个看起来像这样的csv：

HA-MASTER,CategoryID
38231-S04-A00,14
39790-S10-A03,14
38231-S04-A00,15
39790-S10-A03,15
38231-S04-A00,16
39790-S10-A03,16
38231-S04-A00,17
39790-S10-A03,17
38231-S04-A00,18
39790-S10-A03,18
38231-S04-A00,19
39795-ST7-000,75
57019-SN7-000,75
38251-SV4-911,75
57119-SN7-003,75
57017-SV4-A02,75
39795-ST7-000,76
57019-SN7-000,76
38251-SV4-911,76
57119-SN7-003,76
57017-SV4-A02,76

我想要做的是重新格式化这些数据，以便每个categoryID只有一行，例如：

14,38231-S04-A00,39790-S10-A03
76,39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02

我还没有找到excel的方法，我可以以编程方式完成此任务。我有超过100,000行。有没有办法使用python CSV读写来做这样的事情？

Answer 1

使用列表字典（Python 2.7解决方案）非常简单：

#!/usr/bin/env python
import fileinput

categories={}
for line in fileinput.input():
    # Skip the first line in the file (assuming it is a header).
    if fileinput.isfirstline():
        continue

    # Split the input line into two fields.   
    ha_master, cat_id = line.strip().split(',')

    # If the given category id is NOT already in the dictionary
    # add a new empty list
    if not cat_id in categories:
        categories[cat_id]=[]

    # Append a new value to the category.
    categories[cat_id].append(ha_master)

# Iterate over all category IDs and lists.  Use ','.join() to
# to output a comma separate list from an Python list.
for k,v in categories.iteritems():
    print '%s,%s' %(k,','.join(v))

Answer 2

是的，有办法：

import csv

def addRowToDict(row):
    global myDict
    key=row[1]
    if key in myDict.keys():
        #append values if entry already exists
        myDict[key].append(row[0])
    else:
        #create entry
        myDict[key]=[row[1],row[0]]


global myDict
myDict=dict()
inFile='C:/Users/xxx/Desktop/pythons/test.csv'
outFile='C:/Users/xxx/Desktop/pythons/testOut.csv'

with open(inFile, 'r') as f:
    reader = csv.reader(f)
    ignore=True
    for row in reader:
        if ignore:
            #ignore first row
            ignore=False
        else:
            #add entry to dict
            addRowToDict(row)


with open(outFile,'w') as f:
    writer = csv.writer(f)
    #write everything to file
    writer.writerows(myDict.itervalues())

只需编辑inFile和outFile

即可

Answer 3

我会在整个文件中读取，创建一个字典，其中键是ID，值是其他数据的列表。

data = {}
with open("test.csv", "r") as f:
    for line in f:
        temp = line.rstrip().split(',')
        if len(temp[0].split('-')) == 3:  # => specific format that ignores the header...
            if temp[1] in data:
                data[temp[1]].append(temp[0])
            else:
                data[temp[1]] = [temp[0]]

with open("output.csv", "w+") as f:
    for id, datum in data.iteritems():
        f.write("{},{}\n".format(id, ','.join(datum)))

Answer 4

使用熊猫！

import pandas
csv_data = pandas.read_csv('path/to/csv/file')

use_this = csv_data.group_by('CategoryID').values

您将获得包含所需内容的列表，现在您只需格式化即可。

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

干杯。

Answer 5

我在尝试的时候看到很多漂亮的答案，但我也会发布我的答案。

import re

csvIN = open('your csv file','r')
csvOUT = open('out.csv','w')

cat = dict()

for line in csvIN:
    line = line.rstrip()
    if not re.search('^[0-9]+',line): continue

    ham, cid = line.split(',')

    if cat.get(cid,False):
        cat[cid] = cat[cid] + ',' + ham
    else:
        cat[cid] = ham

for i in sorted(cat):
    csvOUT.write(i + ',' + cat[i] + '\n')

Answer 6

熊猫方法：

import pandas as pd

df = pd.read_csv('data.csv')

#new = df.groupby('CategoryID')['HA-MASTER'].apply(lambda row: '%s' % ','.join(row))
new = df.groupby('CategoryID')['HA-MASTER'].agg(','.join)

new.to_csv('out.csv')

out.csv：

14,"38231-S04-A00,39790-S10-A03"
15,"38231-S04-A00,39790-S10-A03"
16,"38231-S04-A00,39790-S10-A03"
17,"38231-S04-A00,39790-S10-A03"
18,"38231-S04-A00,39790-S10-A03"
19,38231-S04-A00
75,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
76,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"

Answer 7

这是一个有趣的问题。我的解决方案是将给定键的每个新项目附加到值中的单个字符串，以及用于分隔列的逗号。

with open('Input01.csv') as input_file:
    file_lines = [item.strip() for item in input_file.readlines()]

for item in iter([i.split(',') for i in file_lines]):
    if item[1] in set_vals:
        set_vals[item[1]] = set_vals[item[1]] + ',' + item[0]
    else:
        set_vals[item[1]] = item[0]

with open('Results01.csv','w') as output_file:
    for i in sorted(set_vals.keys()):
        output_file.write('{},{}\n'.format(i, set_vals[i]))

MaxU的实现，使用pandas，具有良好的潜力，看起来非常优雅，但所有值都放在一个单元格中，因为每个字符串都是双引号。例如，与代码“18” - "38231-S04-A00,39790-S10-A03"对应的行会将两个值都放在第二列中。

Answer 8

import csv
from collections import defaultdict

inpath = ''  # Path to input CSV
outpath = ''  # Path to output CSV

output = defaultdict(list)  # To hold {category: [serial_numbers]}

for row in csv.DictReader(open(inpath)):
    output[row['CategoryID']].append(row['HA-MASTER'])

with open(outpath, 'w') as f:
    f.write('CategoryID,HA-MASTER\n')
    for category, serial_number in output.items():
        row = '%s,%s\n' % (category, serial_number)
        f.write(row)

Python CSV编写器

8 个答案: