Question

我有一个文件，假设我需要将它分割成N个较小的文件，最小的块应该至少有X个字节，并且所有文件应该（几乎）具有相同的大小：

所以使用例如N = 4且X = 3的字符串'abcdefghij'将返回['abcd'，'efg'，'hij']，因为：

3 chunks < 4 chunks
4 chars > 3 chars

我写了一个split函数，但它有时会创建一个额外的字符串，所以我应该传递x值而不是计算它。

def split(string, n):
    x = len(string)//n
    return [string[i:i+x] for i in range(0, len(string), x)]

真正的问题是如何计算以最小字节数剪切文件的块数。

def calculate(length, max_n, min_x):
    n, x = ...
    return n, x

是否有一种简单的已知算法可以执行此类操作？

实际上：文件不需要在1个字节中有所不同，因为我想最大化块的数量。

Answer 1

你是什么意思，“用最小字节数剪切文件”？要么你没有完全解释问题，要么没有独特的解决方案。

正如您的解决方案所示，这是一个划分问题：如果L是总长度，您可以将其分为n块任何 n < L 。余数（必须小于 n ）为您提供的块数比其他块多一个。例如，10 % 3 = 1所以在你的例子中，三个块中的一个更长。但是你可以将10 % 7（余数3）除以七个块，其中三个更长（长度为2而不是1）。或者只有10个长度为1的块，如果你真的想要“最大化块数”，就像你写的那样。

更一般地说：对于您指定的任意长度m，选择N = L // m，您的数据块的长度为m和m+1（或仅m，如果L // m没有余数）。正如我所说，这只是一个分裂的问题。

Answer 2

不确定简单或已知，但这似乎可以解决问题。它返回N个字符串，并在集合中为前面的字符串分配额外的字符。

import itertools as it
s = 'abcdefhijklm'
def hunks(s, n):
    size, extra = divmod(len(s), n)
    i = 0
    extras = it.chain(it.repeat(1, extra), it.repeat(0))
    while i < len(s):
        e = next(extras)
        yield s[i:i + size + e]
        i += size + e
list(hunks(s, 4))

Answer 3

def calculate(L, N, X):
    n = min(L//X, N)
    return n, L//n

编辑：

def spread(seq, N=None, X=1):
    """Yield successive subsequences of seq having at least X elements.

    If N is specified, the number of subsequences yielded will not exceed N.

    The first L % X subsequences yielded (where L = len(seq)) will be longer
    by 1 than the remaining ones.

    >>> list(spread('abcdefghij', 4, 3))
    ['abcd', 'efg', 'hij']
    >>> list(spread('abcdefghijklmnopqrstuvwxyz', 4, 7))
    ['abcdefghi', 'jklmnopqr', 'stuvwxyz']

    seq    any object supporting len(...) and slice-indexing
    N      a positive integer (default: L)
    X      a positive integer not greater than L (default: 1)
    """

    # All error-checking code omitted

    L = len(seq)       # length of seq
    assert 0 < X <= L

    if N is None: N = L
    assert 0 < N

    # A total of n subsequences will be yielded, the first r of which will 
    # have length x + 1, and the remaining ones will have length x.

    # if we insist on using calculate()...
    # n, x = calculate(L, N, X)
    # r = L % n

    # ...but this entails separate computations of L//n and L%n; may as well
    # do both with a single divmod(L, n)
    n = min(L//X, N)
    x, r = divmod(L, n)

    start = 0
    stride = x + 1    # stride will revert to x when i == r
    for i in range(n):
        if i == r: stride = x
        finish = start + stride
        yield seq[start:finish]
        start = finish
    assert start == L

拆分文件不超过N个块但长度最小

3 个答案: