估计文件中的行数 - 文件大小与所有行大小不匹配

时间:2014-12-24 22:24:41

标签: python file io filesize disk

我有几百个文件,每个文件的大小介于MB和几GB之间,我想估计行数(即不需要精确计数)。每条线都非常规则,例如4个长整数和5个双浮点数。

我试图找到文件中第一行AVE_OVER的平均大小,然后用它来估算总行数:

nums = sum(1 for line in open(files[0]))
print "Number of lines = ", nums

AVE_OVER = 10
lineSize = 0.0
count = 0
for line in open(files[0]):
    lineSize += sys.getsizeof(line)
    count += 1
    if( count >= AVE_OVER ): break

lineSize /= count
fileSize = os.path.getsize(files[0])
numLines = fileSize/lineSize
print "Estimated number of lines = ", numLines

估计很远:

> Number of lines =  505235
> Estimated number of lines =  324604.165863

所以我尝试计算文件中所有行的总大小,与sys测量大小进行比较:

fileSize = os.path.getsize(files[0])
totalLineSize = 0.0
for line in open(files[0]):
totalLineSize += sys.getsizeof(line)

print "File size = %.3e" % (fileSize)
print "Total Line Size = %.3e" % (totalLineSize)

但这些也是不一致的!

> File size = 3.366e+07
> Total Line Size = 5.236e+07

为什么每行的大小总和都比实际的总文件大小大?我该如何纠正?


编辑:算法我最终(版本2.0);感谢@ J.F.Sebastian

def estimateLines(files):
    """ Estimate the number of lines in the given file(s) """

    if( not np.iterable(files) ): files = [files]
    LEARN_SIZE = 8192

    # Get total size of all files                                                                                                                                                                   
    numLines = sum( os.path.getsize(fil) for fil in files )

    with open(files[0], 'rb') as file:
         buf = file.read(LEARN_SIZE)
         numLines /= (len(buf) // buf.count(b'\n'))

    return numLines

2 个答案:

答案 0 :(得分:3)

估算文件中的行数:

def line_size_hint(filename, learn_size=1<<13):
    with open(filename, 'rb') as file:
        buf = file.read(learn_size)
        return len(buf) // buf.count(b'\n')

number_of_lines_approx = os.path.getsize(filename) // line_size_hint(filename)

要查找确切的行数,您可以use wc-l.py script

#!/usr/bin/env python
import sys
from functools import partial

print(sum(chunk.count('\n') for chunk in iter(partial(sys.stdin.read, 1 << 15), '')))

答案 1 :(得分:1)

sys.getsizeof是造成问题的唯一原因。它提供了任意依赖于实现的对象大小,除了非常罕见的情况外,根本不应该使用它们。

只需将文件打开为二进制文件,然后使用len获取行的实际长度。