Question

从命令行中查找文件中特定行的字节位置的最快方法是什么？

e.g。

$ linepos myfile.txt 13
5283

我正在为一个大小为几GB的CSV编写解析器，如果解析器停止，我希望能够从最后一个位置恢复。解析器是在Python中，但即使在file.readlines()上进行迭代也需要很长时间，因为文件中有数百万行。我只想做file.seek(int(command.getoutput("linepos myfile.txt %i" % lastrow)))，但我找不到shell命令来有效地执行此操作。

编辑：很抱歉这个混乱，但我正在寻找一个非Python解决方案。我已经知道如何从Python中做到这一点。

Answer 1

来自@Chepner对我的另一个答案的评论：

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:
        file.seek(position)  # zero in base case
        for line in file:
            position = file.tell() # current seek position in file
            # process the line
except:
    print 'exception occurred at position {}'.format(position)
    raise

Answer 2

对文件对象进行迭代会产生完整行结尾的行。您应该只需将len添加到计数器对象即可获得该位置。您需要根据字符编码（字符字节大小）进行乘法

position = 0  # or wherever you left off last time
try:
    with open('myfile.txt') as file:  # don't you go correcting me on naming it file. we don't call file directly anyway!
        file.seek(position)  # zero in base case
        for line in file:
            position += len(line)
            # process the line
except:
    # yes, a naked exception. TWO faux pas in one answer?!?
    print 'exception occurred at position {}'.format(position)
    raise # re-raise to see traceback or what have you

Answer 3

好吧，如果你的模式很简单，那就很简单了

$ echo -e '#!/bin/bash\necho abracadabra' >/tmp/script
$ pattern=bash
$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
    | wc -c 
8

如您所见，这将输出模式中第一个字符的位置，假设文件中的第一个字符的编号为1。

NB 1：sed习惯在它解析的最后一个字符串中添加一个尾随换行符，因此，当我们取pattern之前的一部分行时，输出中的字节数应该是7（计算他们→#!/bin/），但wc -c实际上看起来像

$ sed -rn "0,/$pattern/ {s/^(.*)$pattern.*$/\1/p ;t exit; p; :exit }" /tmp/script \
   | hexdump -C
00000000  23 21 2f 62 69 6e 2f 0a                           |#!/bin/.|
00000008

如果你正在寻找EOF，这可能是麻烦的潜在根源。我想不出一个更合适的案例，我只想指出这一点。

注意2：如果模式包含特殊字符，则sed将失败。如果你能提供一个你想要的例子，我可以逃脱它。

注意3：这假设pattern是唯一的。如果您将停止在pattern的第二个或第三个实例上阅读该文件，则无效。

更新。我找到了一种更简单的方法。

$ grep -bo bash <<< '#!/bin/bash'
7:bash

对于GNU grep，有两个选项：

-b, --byte-offset
    Print the 0-based byte offset within the input file before  each  line  of
    output. If -o (--only-matching)  is specified, print the offset of the
    matching part itself.

我建议使用grep，因为如果指定-F键，它会将模式视为一个简单的字符串。

$ grep -F '!@##$@#%%^%&*%^&*(^)((**%%^@#' <<<'!@##$@#%%^%&*%^&*(^)((**%%^@#' 
!@##$@#%%^%&*%^&*(^)((**%%^@#

如何在文件中查找特定行的字节位置

3 个答案: