在pandas read_csv中强加必需的列约束

时间:2017-03-22 17:51:46

标签: python csv pandas dataframe

我希望将一个大型CSV读入一个数据帧,并附加一个我希望早期失败的约束如果某些列丢失(因为输入不符合预期),但我确实想要所有要包含在Dataframe中的列,而不仅仅是所需的列。在pandas.read_csv中,如果我想指定要读入的列子集,我似乎可以使用usecols参数,但是我可以看到唯一明显的方法来检查数据框中的哪些列我要读的是实际读取文件。

我已经创建了一个工作的第一遍版本,它将数据帧作为迭代器读取,获取第一行,检查列是否存在,然后使用普通参数读取文件:

import pandas as pd
from io import StringIO

class MissingColumnsError(ValueError):
    pass

def cols_enforced_reader(*args, cols_must_exist=None, **kwargs):
    if cols_must_exist is not None:
        # Read the first line of the DataFrame and check the columns
        new_kwargs =  kwargs.copy()
        new_kwargs['iterator'] = True
        new_kwargs['chunksize'] = 1

        if len(args):
            filepath_or_buffer = args[0]
            args = args[1:]
        else:
            filepath_or_buffer = new_kwargs.get('filepath_or_buffer', None)

        df_iterator = pd.read_csv(filepath_or_buffer, *args, **new_kwargs)

        c = next(df_iterator)
        if not all(col in c.columns for col in cols_must_exist):
            raise MissingColumnsError('Some required columns were missing!')

        seek = getattr(filepath_or_buffer, 'seek', None)
        if seek is not None:
            if filepath_or_buffer.seekable():
                filepath_or_buffer.seek(0)

    return pd.read_csv(filepath_or_buffer, *args, **kwargs)

in_csv = """col1,col2,col3\n0,1,2\n3,4,5\n6,7,8"""

# Should succeed
df = cols_enforced_reader(StringIO(in_csv), cols_must_exist=['col1'])
print('First call succeeded as expected.')

# Should fail
try:
    df = cols_enforced_reader(StringIO(in_csv), cols_must_exist=['col7'])
except MissingColumnsError:
    print('Second call failed as expected.')

对我来说这感觉有些混乱,并没有真正处理filepath_or_buffer的所有可能输入(例如,不可搜索的流,或者我不应该从0开始的缓冲区)。显然我现在可以将我在这里的内容调整到我的特定用例并完成它,但我想知道是否有一种更优雅的方法(最好只使用标准的pandas函数)在中工作一般

1 个答案:

答案 0 :(得分:1)

你可以只读一行并测试是否所有必需的列都存在?例如:

import pandas as pd

required_cols = ['col1', 'col2']
cols = pd.read_csv('input.csv', nrows=1).columns

if all(req in cols for req in required_cols):
    print pd.read_csv('input.csv')
else:
    print "Columns missing"

要通过流执行此操作,另一种方法是通过csv.reader()阅读,这与itertools.tee()兼容:

import pandas as pd
from itertools import tee
import csv

required_cols = ['col1', 'col2']

with open('input.csv') as f_input:
    csv_input = csv.reader(f_input)
    csv_stream1, csv_stream2 = tee(csv_input, 2)
    header = next(csv_stream1)

    if all(req in header for req in required_cols):
        df = pd.DataFrame(list(csv_stream2)[1:], columns=header)
        print(df)
    else:
        print("Columns missing")