Question

我有一个问题。有没有可能的方法来查看文件中是否存在列标题，或者直到跳过行？说我有一组文件。一个在第一行上有一个标题，另一个在第二行上有一个标题，在第一行上有一些无用的文本，另一个没有标题。我想跳过列标题之前的所有行，或者检测是否存在一行，而不在代码中指定“skiprows”。有许多硬编码方法可以做到这一点。我使用了正则表达式并替换了等等，但我正在寻找涵盖所有基础的更普遍的想法。我甚至做了一个原始输入提示，允许您输入要跳过的行数。该方法有效，但我想要的东西不必依赖于用户输入，只需要自己检测列标题。我只是在寻找一些想法，如果有的话。我主要使用csv类型的文件，并希望用Python做到这一点。

Answer 1

csv.Sniffer有一个has_header（）函数，如果第一行看起来是标题，则该函数应返回True。使用它的过程是首先从顶部删除所有空行，直到第一个非空行，然后运行csv.Sniffer.has_header（）。我的经验是，标题必须在has_header（）的第一行中返回True，如果标题字段的数量与其扫描范围中至少一行的数据字段数不匹配，它将返回False由用户设置。 1024或2048是典型的扫描范围。我试图将它设置得更高，即使整个文件都会被读取，但如果它不在第一行，它仍然无法识别标题。我的所有测试都是使用Python 2.7.10完成的。

以下是在脚本中使用csv.Sniffer的示例，该脚本首先确定文件是否具有可识别的标头，如果不重命名，则使用原始名称创建新的空文件，然后打开重命名的文件以进行读取和用于写入的新文件，并将重命名的文件内容写入新文件，不包括前导空白行。最后，它重新测试新文件的标题，以确定删除空行是否有所不同。

import csv
from datetime import datetime
import os
import re
import shutil
import sys
import time

common_delimeters = set(['\' \'', '\'\t\'', '\',\''])

def sniff(filepath):
   with open(filepath, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(2048))
        delimiter = repr(dialect.delimiter)
        if delimiter not in common_delimeters:
            print filepath,'has uncommon delimiter',delimiter
        else:
            print filepath,'has common delimiter',delimiter
        csvfile.seek(0)
        if csv.Sniffer().has_header(csvfile.read(2048)):
            print filepath, 'has a header'
            return True
        else:
            print filepath, 'does not have a header'
            return False

def remove_leading_blanks(filepath):
    # test filepath  for header and delimiter
    print 'testing',filepath,'with sniffer'
    has_header = sniff(filepath)
    if has_header:
        print 'no need to remove leading blank lines if any in',filepath
        return True
    # make copy of filepath appending current date-time to its name
    if os.path.isfile(filepath):
        now = datetime.now().strftime('%Y%d%m%H%M%S')
        m = re.search(r'(\.[A-Za-z0-9_]+)\Z',filepath)
        bakpath = ''
        if m != None:
            bakpath = filepath.replace(m.group(1),'') + '.' + now + m.group(1)
        else:
            bakpath = filepath + '.' + now       
        try:
            print 'renaming', filepath,'to', bakpath
            os.rename(filepath, bakpath)
        except:
            print 'renaming operation failed:', sys.exc_info()[0]
            return False
       print 'creating a new',filepath,'from',bakpath,'minus leading blank lines'
        # now open renamed file and copy it to original filename
        # except for leading blank lines
        time.sleep(2)
        try:
            with open(bakpath) as o, open (filepath, 'w') as n:
                p = False
                for line in o:
                    if p == False:
                        if line.rstrip():
                            n.write(line)
                            p = True
                        else:
                            continue
                    else:
                        n.write(line)
        except IOError as e:
            print 'file copy operation failed: %s' % e.strerror   
            return False
        print 'testing new',filepath,'with sniffer'       
        has_header = sniff(filepath)
        if has_header:
            print 'the header problem with',filepath,'has been fixed'
        return True
        else:
            print 'the header problem with',filepath,'has not been fixed'
            return False

鉴于此csv文件的标题实际位于第11行：

header,better,leader,fodder,blather,super
1,2,3,,,
4,5,6,7,8,9
3,4,5,6,7,
2,,,,,

remove_leading_blanks（）确定它没有标题，然后删除前导空行并确定它确实有标题。以下是其控制台输出的跟踪：

testing test1.csv with sniffer...
test1.csv has uncommon delimiter '\r'
test1.csv does not have a header
renaming test1.csv to test1.20153108142923.csv
creating a new test1.csv from test1.20153108142923.csv minus leading blank lines
testing new test1.csv with sniffer
test1.csv has common delimiter ','
test1.csv has a header
the header problem with test1.csv has been fixed
done ok

虽然这可能会在很多时候起作用，但由于标题及其位置的过多变化，通常它看起来不可靠。然而，也许它总比没有好。

有关详细信息，请参阅csv.Sniffer，csv.py和_csv.c。 PyMOTW's csv – Comma-separated value files对csv模块进行了很好的教程评估，详细介绍了方言。

检查Python pandas是否存在标头

1 个答案: