Question

我有一个文本文件，每行包含一个时间戳。我的目标是找到时间范围。所有时间都是有序的，所以第一行将是最早的时间，最后一行将是最晚的时间。我只需要第一行和最后一行。在python中获取这些行的最有效方法是什么？

注意：这些文件的长度相对较大，每个大约1-2万行，我必须为几百个文件执行此操作。

Answer 1

您可以打开文件进行阅读并使用内置readline()读取第一行，然后搜索到文件末尾并向后退，直到找到前一行EOL并读取最后一行从那里。

with open(file, "rb") as f:
    first = f.readline()        # Read the first line.
    f.seek(-2, os.SEEK_END)     # Jump to the second last byte.
    while f.read(1) != b"\n":   # Until EOL is found...
        f.seek(-2, os.SEEK_CUR) # ...jump back the read byte plus one more.
    last = f.readline()         # Read last line.

跳转到倒数第二个字节而不是最后一个字节会阻止您因尾随EOL而直接返回。当你向后退时，你也会想要两个字节，因为阅读和检查EOL会将位置向前推进一步。

使用seek时，格式为fseek(offset, whence=0)，其中whence表示相对于偏移的内容。引自docs.python.org：

SEEK_SET或0 =从流的开头搜索（默认值）; offset必须是返回的数字   TextIOBase.tell()，或零。产生任何其他偏移值   未定义的行为。

SEEK_CUR或1 =“寻求”当前位置; offset必须为零，这是一个无操作（所有其他值都是   不支持的）。

SEEK_END或2 =寻找流的末尾; offset必须为零（不支持所有其他值）。

在6k行总计200kB的文件上运行10k次，与之前建议的for循环相比，给出了1.62s vs 6.92s。使用1.3GB大小的文件，仍然有6k行，一百次导致8.93对86.95。

with open(file, "rb") as f:
    first = f.readline()     # Read the first line.
    for last in f: pass      # Loop through the whole file reading it all.

Answer 2

docs for io module

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

此处的变量值为1024：它表示平均字符串长度。我只选择1024例如。如果您估计平均线长，则可以使用该值乘以2。

由于您不知道行长度的可能上限，显而易见的解决方案是遍历文件：

for line in fh:
    pass
last = line

您无需使用二进制标记，只需使用open(fname)。

ETA ：由于您有许多文件需要处理，您可以使用random.sample创建几个文件的示例，并在其上运行此代码以确定最后一行的长度。具有位置偏移的先验大值（假设1 MB）。这将帮助您估算完整运行的值。

Answer 3

这是SilentGhost答案的修改版本，可以满足您的需求。

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

这里不需要行长的上限。

Answer 4

你能使用unix命令吗？我认为使用head -1和tail -n 1可能是最有效的方法。或者，您可以使用简单的fid.readline()来获取第一行和fid.readlines()[-1]，但这可能会占用太多内存。

Answer 5

这是我的解决方案，与Python3兼容。它也管理边界案例，但它错过了utf-16支持：

def tail(filepath):
    """
    @author Marco Sulla (marcosullaroma@gmail.com)
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

它受到Trasp's answer和AnotherParker's comment的启发。

Answer 6

首先在读取模式下打开文件。然后使用readlines（）方法逐行读取。所有行都存储在列表中。现在可以使用列表切片来获取文件的第一行和最后一行。

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]

Answer 7

w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

for循环遍历行，x获取最后一次迭代的最后一行。

Answer 8

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

Answer 9

这是@Trasp的答案的扩展，它有一个额外的逻辑来处理只有一行的文件的边角情况。如果您反复想要读取不断更新的文件的最后一行，则处理此情况可能很有用。如果没有这个，如果您尝试抓取刚刚创建的文件的最后一行并且只有一行，则会引发IOError: [Errno 22] Invalid argument。

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

Answer 10

没有人提到使用反向：

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()

Answer 11

获得第一行非常简单。对于最后一行，假设您知道行长度的近似上限，os.lseek来自SEEK_END的某些数量会找到倒数第二行的结尾，然后是readline()最后一行。

Answer 12

with open(filename, "r") as f:
    first = f.readline()
    if f.read(1) == '':
        return first
    f.seek(-2, 2)  # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found...
        f.seek(-2, 1)  # ...jump back the read byte plus one more.
    last = f.readline()  # Read last line.
    return last

以上答案是以上答案的修改版本，用于处理文件中只有一行的情况

获取文本文件的第一行和最后一行的最有效方法是什么？

12 个答案: