调整CSV数据:将单元格附加到上一行,合并包含特定字符串的单元格

时间:2013-07-09 20:05:07

标签: python csv python-3.x

我有一个文件 data.csv ,看起来像这样(两列; A和B):

A       B
01      a
        'b'
0101    a
        b
010101  a
        'b'
        'c'
        d
        'e'
        f
010102  a
        b
        'd'
        'e'
010201  a
        b
        'c'
        d

02      a
        b
0201    a
        b

020101  a
        b
        'd'
        'e'
020102  a
        'b'
        c
020201  a
        b
        c
        d
        'e'
020301  a
        'b'
        c
        d

我希望它看起来像这样(五列; A,B,C,D和E):

A       B   C   D   E
01      a   b       
0101    a   b       
010101  a   b   c   d, e, f
010102  a   b       d, e
010201  a   b   c   d
02      a           
0201    a   b       
020101  a   b       d, e
020102  a   b   c   
020201  a   b   c   d, e
020301  a   b   c   d

这就是我所知道的 data.csv

  • UTF-8编码
  • UNIX样式行尾
  • 制表符分隔符
  • 某些为空(空单元
  • 某些以空的单元(制表机)开头
  • 某些以两位,四位或六位开头
  • 某些单元格包含文本字符串,此处由单个字符
  • 表示
  • 某些文字字符串被'signs
  • 包围
  • 不能假设“a”,“b”和“c”值始终存在
  • “a”,“b”或“c”
  • 没有模式
  • “d”,“e”和“f”有一种模式 - 单词foo是其字符串的一部分

data.csv 作为文本文件处理我将脚本放在一起:

  • 删除空行
  • 将以制表符(空单元格)开头的行附加到上一行
  • 删除'标志

代码:

#!/usr/bin/python3
f = open('data.csv')
c = f.read()
f.close()
c = c.replace('\n\n', '\n')
c = c.replace('\n\t', '\t')
c = c.replace("'", "")
f = open('output.csv', 'w')
f.write(c)
f.close()

......然后我卡住了。也许使用csv模块可以更加统一地执行此操作以及其他调整。如何使用Python 3.3解决这个问题(我假设任何3.x解决方案都兼容)?

更新

基于Martijn Pieter的回答,我提出了这个问题,而似乎正在工作,虽然我不确定'a','b'和'c'文本值是总是放在适当的列中。此外,最后一行被跳过/留空。

#!/usr/bin/python3

import csv

with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    write_this_row = None
    for row in reader:
        # If there is a row with content...
        if row:
            # If the first cell has content...
            if row[0]:
                if write_this_row != None:
                    writer.writerow(write_this_row)
                write_this_row = row
            elif 'foo' in row[1]:
                if len(write_this_row) < 5:
                    write_this_row.extend([''] * (5 - len(row)))
                if write_this_row[4]:
                    write_this_row[4] += ';' + row[1]
                else:
                    write_this_row[4] = row[1]
            else:
                write_this_row.insert(3, row[1])

1 个答案:

答案 0 :(得分:2)

只需使用csv模块读取数据,每行按一下,然后再将其写出来。

您可以使用None或空字符串''作为该列的值来创建“空”列。反之亦然,读取空列(因此在连续的标签之间)会为您提供空字符串。

with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')

    for row in reader:
        if len(row) > 3:
            # detect if `c` is missing (insert your own test here)
            # sample test looks for 3 consecutive columns with values f, o and o
            if row[3:6] == ['f', 'o', 'o']
                # insert an empty `c`
                row.insert(3, '')

        if len(row) < 5:
            # make row at least 5 columns long
            row.extend([''] * (5 - len(row)))
        if len(row) > 5:
            # merge any excess columns into the 5th column
            row[4] = ','.join(row[4:])
            del row[5:]

        writer.writerow(row)

<强>更新

不使用标志,而是使用阅读器作为迭代器(在其上调用next()以获取下一行而不是使用for循环):

with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')

    row = None

    try:
        next(reader)  # skip the `A   B` headers.

        line = next(reader)  # prime our loop
        while True:
            while not line[0]:
                # advance to the first line with a column 0 value
                line = next(reader)

            row = line  # start off with the first number and column
            line = next(reader)  # prime the subsequent lines loop

            while line and not line[0]:
                # process subsequent lines until we find one with a value in col 0 again
                cell = line[1]
                if cell == 'foo':    # detect column d
                    row.append('')   # and insert empty value
                row.append(cell)
                line = next(reader)

            # consolidate, write
            if len(row) < 5:
                # make row at least 5 columns long
                row.extend([''] * (5 - len(row)))
            if len(row) > 5:
                # merge any excess columns into the 5th column
                row[4] = ','.join(row[4:])
                del row[5:]

            writer.writerow(row)
            row = None
    except StopIteration:
        # reader is done, no more lines to come
        # process the last row if there was one
        if row is not None:
            # consolidate, write
            if len(row) < 5:
                # make row at least 5 columns long
                row.extend([''] * (5 - len(row)))
            if len(row) > 5:
                # merge any excess columns into the 5th column
                row[4] = ','.join(row[4:])
                del row[5:]

            writer.writerow(row)