是否有一种pythonic方法可以确定CSV文件中的哪些行包含标题和值以及哪些行包含垃圾,然后将标题/值行转换为数据框?
我对python相对较新,并且一直用它来读取从科学仪器的数据记录中导出的多个CSV,到目前为止处理其他任务的CSV时,我一直默认使用pandas
库。但是,这些CSV导出可能会根据每台仪器上记录的“测试”数量而有所不同。
仪器之间的列标题和数据结构是相同的,但是有一个“前导码”将每个可以更改的测试分开。所以我最终看起来像这样的备份(对于这个例子,有两个测试,但可能有任何数量的测试):
blah blah here's a test and
here's some information
you don't care about
even a little bit
header1, header2, header3
1, 2, 3
4, 5, 6
oh you have another test
here's some more garbage
that's different than the last one
this should make
life interesting
header1, header2, header3
7, 8, 9
10, 11, 12
13, 14, 15
如果每次我使用skiprow参数时它都是固定长度的前导码,但前导码是可变长度的,并且每个测试中的行数是可变长度。
我的最终目标是能够合并所有测试并最终得到类似:
header1, header2, header3
1, 2, 3
4, 5, 6
7, 8, 9
10, 11, 12
13, 14, 15
然后我可以像往常一样用熊猫操纵。
我已尝试以下方法查找带有预期标题的第一行:
import csv
import pandas as pd
with open('my_file.csv', 'rb') as input_file:
for row_num, row in enumerate(csv.reader(input_file, delimiter=',')):
# The CSV module will return a blank list []
# so added the len(row)>0 so it doesn't error out
# later when searching for a string
if len(row) > 0:
# There's probably a better way to find it, but I just convert
# the list to a string then search for the expected header
if "['header1', 'header2', 'header3']" in str(row):
header_row = row_num
df = pd.read_csv('my_file.csv', skiprows = header_row, header=0)
print df
如果我只有一个测试因为它找到了第一行包含标题,那么这是有效的,但是header_row
变量当然会在找到标题时每次更新一次,所以在上面的例子中我结束了输出:
header1 header2 header3
0 7 8 9
1 10 11 12
2 13 14 15
我迷失了如何在继续搜索标头/数据集的下一个实例之前,弄清楚如何将标头/数据集的每个实例附加到数据框。
处理大量文件时,使用csv
模块打开一次,然后再使用pandas
打开它可能效率不高。
答案 0 :(得分:0)
此计划可能有所帮助。它本质上是csv.reader()
对象的包装器,它将包装好的数据输出。
import pandas as pd
import csv
import sys
def ignore_comments(fp, start_fn, end_fn, keep_initial):
state = 'keep' if keep_initial else 'start'
for line in fp:
if state == 'start' and start_fn(line):
state = 'keep'
yield line
elif state == 'keep':
if end_fn(line):
state = 'drop'
else:
yield line
elif state == 'drop':
if start_fn(line):
state = 'keep'
if __name__ == "__main__":
df = open('x.in')
df = csv.reader(df, skipinitialspace=True)
df = ignore_comments(
df,
lambda x: x == ['header1', 'header2', 'header3'],
lambda x: x == [],
False)
df = pd.read_csv(df, engine='python')
print df
答案 1 :(得分:0)
是的,基于Pandas有更多的pythonic方式来做这件事(这是回答问题的快速演示)
import pandas as pd
from StringIO import StringIO
#define an example to showcase the solution
st = """blah blah here's a test and
here's some information
you don't care about
even a little bit
header1, header2, header3
1, 2, 3
4, 5, 6
oh you have another test
here's some more garbage
that's different than the last one
this should make
life interesting
header1, header2, header3
7, 8, 9
10, 11, 12
13, 14, 15"""
# 1- read the data with pd.read_csv
# 2- specify that you want to drop bad lines, error_bad_lines=False
# 3- The header has to be the first row of the file. Since this is not the case, let's manually define it with names=[...] and header=None.
data = pd.read_csv(StringIO(st), delimiter=",", names=["header1","header2", "header3"], error_bad_lines=False, header=None)
# the trash will be loaded as follows
# blah blah here's a test and NaN NaN
# let's drop these rows
data = data.dropna()
# remove the rows which contain "header1","header2", "header3"
mask = data["header1"].str.contains('header*')
data = data[~mask]
print data
现在你的dataFrame看起来像这样:
header1 header2 header3
5 1 2 3
6 4 5 6
13 7 8 9
14 10 11 12
15 13 14 15