Question

是否可以指示Pandas忽略超过标题大小的列？

import pandas

with open('test.csv', mode='w') as csv_file:
    csv_file.write("datetime,A\n")
    csv_file.write("2018-10-09 18:00:07, 123\n")

df = pandas.read_csv('test.csv')
print(df)

给出答案：

              datetime    A
0  2018-10-09 18:00:07  123

不过，加载的CSV文件包含更多在标题中定义的数据列：

with open('test.csv', mode='w') as csv_file:
    csv_file.write("datetime,A\n")
    csv_file.write("2018-10-09 18:00:07, 123, ABC, XYZ\n")

df = pandas.read_csv('test.csv')
print(df)

返回：

                        datetime     A
2018-10-09 18:00:07 123      ABC   XYZ

Pandas将标题移到数据的最右边。

我需要不同的行为。我希望熊猫忽略标题以外的数据行。

注意：我无法枚举列，因为这是一个通用的用例。由于某些与我的代码无关的原因，有时会有更多预期的数据。我想忽略多余的数据。

Answer 1

似乎Pandas意识到与实际的标头相比，列太多了，并假设前两个（数据）列是（多）索引。

使用usecols中的read_csv参数指定要读取的数据列：

import pandas

with open('test.csv', mode='w') as csv_file:
    csv_file.write("datetime,A\n")
    csv_file.write("2018-10-09 18:00:07, 123, ABC, XYZ\n")

df = pandas.read_csv('test.csv', usecols=[0,1]) 
print(df)

收益

              datetime    A
0  2018-10-09 18:00:07  123

Answer 2

现在代码显示了问题的答案。

with open('test.csv', mode='w') as csv_file:
    csv_file.write("datetime,A\n")
    csv_file.write("2018-10-09 18:00:07, 123, ABC, XYZ\n")

with open("test.csv") as csv_file:
    for i, line in enumerate(csv_file):
        if i == 0:
            headerCount = line.count(",") + 1
            colCount = headerCount
        elif i == 1:
            dataCount = line.count(",") + 1  
        elif i > 1:
            break
if (headerCount < dataCount):
    print("Warning: Header and data size mismatch. Columns beyond header size will be removed.")
    colCount=headerCount

df = pandas.read_csv('test.csv', usecols=range(colCount))
print(df)

产生：

Warning: Header and data size mismatch. Columns beyond header size will be removed.
              datetime    A
0  2018-10-09 18:00:07  123

Answer 3

要使问题更完整，请使用以下技巧：

with open('test.csv', mode='w') as csv_file:
    csv_file.write("datetime,A, B, C\n")
    csv_file.write("2018-10-09 18:00:07, 123\n")

with open("test.csv") as csv_file:
    for i, line in enumerate(csv_file):
        if i == 0:
            headerCount = line.count(",") + 2
        elif i == 1:
            dataCount = line.count(",") + 2  
            if (headerCount != dataCount):
                print("Warning: Header and data size mismatch. Columns beyond header size will be removed.")
        elif i > 1:
            break


df = pandas.read_csv('test.csv', usecols=range(dataCount-1))
print(df)

给出正确的熊猫对象。

Warning: Header and data size mismatch. Columns beyond header size will be removed.
              datetime    A
0  2018-10-09 18:00:07  123

熊猫：CSV标头和数据行大小不匹配

3 个答案: