Question

我在CSV文件中有一个数据集，该文件从会计应用程序下载到CSV文件中。数据结构的方式很好，除了它在页面中分割。因此，CSV文件包含不需要的垃圾行：

Company:ABC Ltd Date: 30-Mar-2017
GL Download

                                              Page No 1

GL Code,GL Name,Journal Id,Amount $,Vendor,Vendor Code,Text
1001200,SalesUK,5060400604,1,234.34,GroveT,234565,FC approved

此处有更多数据......

Company:ABC Ltd Date: 30-Mar-2017
GL Download

                                              Page No 2

GL Code,GL Name,Journal Id,Amount $,Vendor,Vendor Code,Text
34560432,SalesUK,5060434567,4,356.19,Legend,135678,checked

每次到达分页符时，都会重复标题。我正在尝试将CSV文件中的数据上传到pandas.DataFrame，但问题出在我需要摆脱的那些分页符和重复的标题中。

pandas或python csv模块中是否有标准解决方案来克服数据行，例如那些页码和标题？

Answer 1

Pandas允许您传递自己的解析器。如果传递参数engine='python'，则期望filepath_or_buffer（第一个参数）将是迭代器返回列表。这与csv模块的作用相同。因此，您可以提供与此签名匹配的生成器，如：

<强>代码：

此代码为相关报告格式提供自定义解析器。它只产生数据线。

def my_csv_reader(csvfile_handle):

    looking_for_header = True
    prev_line_blank = False
    for line in (x.strip() for x in csvfile_handle.readlines()):
        blank_line = len(line.strip()) == 0
        if looking_for_header:
            if not blank_line and prev_line_blank:
                looking_for_header = not line.startswith('GL Code,')
        elif not blank_line:
            yield line
        else:
            looking_for_header = True

        prev_line_blank = blank_line

使用解析器的代码：

要使用解析器，我们打开文件，从该文件构造一个csv_reader生成器，然后调用pandas.csv_reader()：

with open('report.csv', 'rU') as csvfile:
    reader = csv.reader(my_csv_reader(csvfile))
    df = pd.read_csv(
        reader, engine='python', header=None, index_col=False,
        names='GL Code,GL Name,Journal Id,Category,Amount $' \
              ',Vendor,Vendor Code,Text'.split(',')
    )

示例结果：

    GL Code  GL Name  Journal Id  Category  Amount $  Vendor  Vendor Code  \
0   1001200  SalesUK  5060400604         1    234.34  GroveT       234565   
1  34560432  SalesUK  5060434567         4    356.19  Legend       135678   

          Text  
0      checked  
1  FC approved

python pandas形状数据关闭CSV文件

1 个答案: