我有一个奇怪的csv文件,其中包含带有标题值的数据及其相应的数据,如下所示:
,,,Completed Milling Job,,,,,, # row 1
,,,,Extended Report,,,,,
,,Job Spec numerical control,,,,,,,
Job Number,3456,,,,,, Operator Id,clipper,
Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22,
Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16,
我需要从这个结构中提取数据,按照以下结构创建另一个csv文件:
Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row
如果您注意到,添加了一个名为“status”的新标题列,但该值位于csv文件的第一行中。输出文件中的其余列名将从原始文件中提取。
任何想法都将不胜感激 - 谢谢
答案 0 :(得分:0)
假设文件完全相同(至少在上限方面),这应该可行,但我只能保证你提供的确切数据:
#!/usr/bin/python
import glob
from sys import argv
g=open(argv[2],'w')
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n")
for fname in glob.glob(argv[1]):
with open(fname) as f:
status=f.readline().strip().strip(',')
f.readline()#extended report not needed
f.readline()#job spec numerical control not needed
s=f.readline()
job_no=s.split('Job Number,')[1].split(',')[0]
op_id=s.split('Operator Id,')[1].strip().strip(',')
s=f.readline()
machine_name=s.split('Coder Machine Name,')[1].split(',')[0]
start_t=s.split('Job Start time,')[1].strip().strip(',')
s=f.readline()
machine_type=s.split('Machine type,')[1].split(',')[0]
end_t=s.split('Job end time,')[1].strip().strip(',')
g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n")
g.close()
它需要一个glob参数(如Job*.data
)和一个输出文件名,并且应该构建你需要的东西。只需将其保存为“so.py”或其他内容,然后将其作为python so.py <data_files_wildcarded> output.csv
答案 1 :(得分:0)
这是一个解决方案,应该适用于任何与您显示的模式相同的CSV文件。这是一种非常讨厌的格式。
我对这个问题很感兴趣并在午休期间进行了研究。这是代码:
COMMA = ','
NEWLINE = '\n'
def _kvpairs_from_line(line):
line = line.strip()
values = [item.strip() for item in line.split(COMMA)]
i = 0
while i < len(values):
if not values[i]:
i += 1 # advance past empty value
else:
# yield pair of values
yield (values[i], values[i+1])
i += 2 # advance past pair
def kvpairs_by_column_then_row(lines):
"""
Given a series of lines, where each line is comma-separated values
organized as key/value pairs like so:
key_1,value_1,key_n+1,value_n+1,...
key_2,value_2,key_n+2,value_n+2,...
...
key_n,value_n,key_n+n,value_n+n,...
Yield up key/value pairs taken from the first column, then from the second column
and so on.
"""
pairs = [_kvpairs_from_line(line) for line in lines]
done = [False for _ in pairs]
while not all(done):
for i in range(len(pairs)):
if not done[i]:
try:
key_value_tuple = next(pairs[i])
yield key_value_tuple
except StopIteration:
done[i] = True
STATUS = "Status"
columns = [STATUS]
d = {}
with open("data.csv", "rt") as f:
# get an iterator that lets us pull lines conveniently from file
itr = iter(f)
# pull first line and collect status
line = next(itr)
lst = line.split(COMMA)
d[STATUS] = lst[3]
# pull next lines and make sure the file is what we expected
line = next(itr)
assert "Extended Report" in line
line = next(itr)
assert "Job Spec numerical control" in line
# pull all remaining lines and save in a list
lines = [line.strip() for line in f]
for key, value in kvpairs_by_column_then_row(lines):
columns.append(key)
d[key] = value
with open("output.csv", "wt") as f:
# write column headers line
line = COMMA.join(columns)
f.write(line + NEWLINE)
# write data row
line = COMMA.join(d[key] for key in columns)
f.write(line + NEWLINE)