Question

您好我正在使用pandas读取一系列文件并将它们连接到数据帧。我的文件开头有一堆垃圾，可变长度，我想忽略。 pd.read_csv()具有skiprows方法。我已经编写了一个函数来处理这种情况，但我必须打开文件两次才能使它工作。还有更好的方法吗？

HEADER = '#Start'

def header_index(file_name):
    with open(file_name) as fp:
        for ind, line in enumerate(fp):
            if line.startswith(HEADER):
                return ind

for row in directories:
    path2file = '%s%s%s' % (path2data, row, suffix)
    myDF = pd.read_csv(path2file, skiprows=header_index(path2file), header=0, delimiter='\t')

非常感谢任何帮助。

Answer 1

现在可以（如下所示）（不知道是否有可能）如下：

pos= 0
oldpos = None

while pos != oldpos:  # make sure we stop reading, in case we reach EOF
    line= fp.readline()
    if line.startswith(HEADER):
        # set the read position to the start of the line
        # so pandas can read the header
        fp.seek(pos)
        break
    oldpos= pos
    pos= fp.tell()    # renenber this position as sthe start of the next line

pd.read_csv(fp, ...your options here...)

Answer 2

由于# (Details of the ffmpeg command omitted for brevity). ... | foreach { # The RHS of && is only executed if the command on the LHS succeeded, # which in the case of external programs means: $LASTEXITCODE equals 0. ffmpeg $_ ... | Write-Host && $_ } | ...还接受类似object的文件，因此您可以在传递该对象之前跳过开头的垃圾行---而不是传递文件名。

示例：

替换

read_csv()

具有：

df = pd.read_csv(filename, skiprows=no_junk_lines(filename), ...)

注意：

def forward_csv(f, prefix): pos = 0 while True: line = f.readline() if not line or line.startswith(prefix): f.seek(pos) return f pos += len(line.encode('utf-8')) df = pd.read_csv(forward_csv(open(filename), HEADER), ...)在到达EOF时返回空字符串
不调用readline()跟踪位置可节省一些tell()系统调用
lseek的最后一行假定您的输入文件是用ASCII或UTF-8编码的-如果不是，则必须调整此行

Python Pandas读取具有可变前导码长度的csv文件

2 个答案: