从csv中提取奇怪排列的数据并使用python转换为另一个csv文件

时间:2014-01-09 18:49:03

标签: python

我有一个奇怪的csv文件,其中包含带有标题值的数据及其相应的数据,如下所示:

,,,Completed Milling Job,,,,,, # row 1

,,,,Extended Report,,,,,

,,Job Spec numerical control,,,,,,,

Job Number,3456,,,,,, Operator Id,clipper,

Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22,

Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16,

我需要从这个结构中提取数据,按照以下结构创建另一个csv文件:

Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row

如果您注意到,添加了一个名为“status”的新标题列,但该值位于csv文件的第一行中。输出文件中的其余列名将从原始文件中提取。

任何想法都将不胜感激 - 谢谢

2 个答案:

答案 0 :(得分:0)

假设文件完全相同(至少在上限方面),这应该可行,但我只能保证你提供的确切数据:

#!/usr/bin/python
import glob
from sys import argv

g=open(argv[2],'w')
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n")
for fname in glob.glob(argv[1]):
    with open(fname) as f:
        status=f.readline().strip().strip(',')
        f.readline()#extended report not needed
        f.readline()#job spec numerical control not needed
        s=f.readline()
        job_no=s.split('Job Number,')[1].split(',')[0]
        op_id=s.split('Operator Id,')[1].strip().strip(',')
        s=f.readline()
        machine_name=s.split('Coder Machine Name,')[1].split(',')[0]
        start_t=s.split('Job Start time,')[1].strip().strip(',')
        s=f.readline()
        machine_type=s.split('Machine type,')[1].split(',')[0]
        end_t=s.split('Job end time,')[1].strip().strip(',')
    g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n")
g.close()

它需要一个glob参数(如Job*.data)和一个输出文件名,并且应该构建你需要的东西。只需将其保存为“so.py”或其他内容,然后将其作为python so.py <data_files_wildcarded> output.csv

运行

答案 1 :(得分:0)

这是一个解决方案,应该适用于任何与您显示的模式相同的CSV文件。这是一种非常讨厌的格式。

我对这个问题很感兴趣并在午休期间进行了研究。这是代码:

COMMA = ','
NEWLINE = '\n'

def _kvpairs_from_line(line):
    line = line.strip()
    values = [item.strip() for item in line.split(COMMA)]

    i = 0
    while i < len(values):
        if not values[i]:
            i += 1  # advance past empty value
        else:
            # yield pair of values
            yield (values[i], values[i+1])
            i += 2  # advance past pair

def kvpairs_by_column_then_row(lines):
    """
    Given a series of lines, where each line is comma-separated values
    organized as key/value pairs like so:
        key_1,value_1,key_n+1,value_n+1,...
        key_2,value_2,key_n+2,value_n+2,...
        ...
        key_n,value_n,key_n+n,value_n+n,...

    Yield up key/value pairs taken from the first column, then from the second column
    and so on.
    """
    pairs = [_kvpairs_from_line(line) for line in lines]
    done = [False for _ in pairs]
    while not all(done):
        for i in range(len(pairs)):
            if not done[i]:
                try:
                    key_value_tuple = next(pairs[i])
                    yield key_value_tuple
                except StopIteration:
                    done[i] = True

STATUS = "Status"
columns = [STATUS]

d = {}

with open("data.csv", "rt") as f:
    # get an iterator that lets us pull lines conveniently from file
    itr = iter(f)

    # pull first line and collect status
    line = next(itr)
    lst = line.split(COMMA)
    d[STATUS] = lst[3]

    # pull next lines and make sure the file is what we expected
    line = next(itr)
    assert "Extended Report" in line
    line = next(itr)
    assert "Job Spec numerical control" in line

    # pull all remaining lines and save in a list
    lines = [line.strip() for line in f]

for key, value in kvpairs_by_column_then_row(lines):
    columns.append(key)
    d[key] = value

with open("output.csv", "wt") as f:
    # write column headers line
    line = COMMA.join(columns)
    f.write(line + NEWLINE)
    # write data row
    line = COMMA.join(d[key] for key in columns)
    f.write(line + NEWLINE)