Mark Ransom答案的证据

Question

我有一个13 GB的大文本文件，158,609,739行，我想随机选择155,000,000行。

我试图加扰文件，然后削减155000000第一行，但似乎我的ram内存（16GB）不足以做到这一点。我试过的管道是：

shuf file | head -n 155000000
sort -R file | head -n 155000000

现在不用选择行，我认为更有效的内存从文件中删除3,609,739个随机行，以获得155000000行的最终文件。

Answer 1

在将文件的每一行复制到输出时，请评估其应删除的概率。第一行应该有3,609,739 / 158,609,739被删除的机会。如果生成0到1之间的随机数且该数字小于该比率，请不要将其复制到输出中。现在第二线的赔率为3,609,738 / 158,609,738;如果该行未被删除，则第三行的赔率为3,609,738 / 158,609,737。重复直到完成。

由于处理每条线的几率会发生变化，因此该算法可确保精确的线数。一旦你删除了3,609,739，赔率就会降到零;如果您在任何时候需要删除文件中的每个剩余行，则赔率为1。

Answer 2

您可以随时预先生成您计划删除的行号（未经替换选择的3,609,739个随机数列表），然后只需遍历文件并复制到另一个，根据需要跳过行。只要你有一个新文件的空间，这将有效。

您可以使用random.sample选择随机数如，

random.sample(xrange(158609739), 3609739)

Answer 3

Mark Ransom答案的证据

让我们更容易思考数字（至少对我来说！）：

10件
删除3个

首次通过循环，我们将假设前三个项目被删除 - 这是概率的样子：

第一项：3/10 = 30％
第二项：2/9 = 22％
第三项：1/8 = 12％
第四项：0/7 = 0％
第五项：0/6 = 0％
第六项：0/5 = 0％
第七项：0/4 = 0％
第八项：0/3 = 0％
第九项：0/2 = 0％
第十项：0/1 = 0％

如您所见，一旦达到零，它就会保持为零。但如果什么都没有被删除呢？

第一项：3/10 = 30％
第二项：3/9 = 33％
第三项：3/8 = 38％
第四项：3/7 = 43％
第五项：3/6 = 50％
第六项：3/5 = 60％
第七项：3/4 = 75％
第八项：3/3 = 100％
第九项：2/2 = 100％
第十项：1/1 = 100％

因此，即使概率因线而异，但总体而言，您可以获得所需的结果。我更进了一步，用Python编写了一个测试，进行了一百万次迭代，作为我自己的最后证明 - 从100个列表中删除了七个项目：

# python 3.2
from __future__ import division
from stats import mean  # http://pypi.python.org/pypi/stats
import random

counts = dict()
for i in range(100):
    counts[i] = 0

removed_failed = 0

for _ in range(1000000):
    to_remove = 7
    from_list = list(range(100))
    removed = 0
    while from_list:
        current = from_list.pop()
        probability = to_remove / (len(from_list) + 1)
        if random.random() < probability:
            removed += 1
            to_remove -= 1
            counts[current] += 1
    if removed != 7:
        removed_failed += 1

print(counts[0], counts[1], counts[2], '...',
      counts[49], counts[50], counts[51], '...',
      counts[97], counts[98], counts[99])
print("remove failed: ", removed_failed)
print("min: ", min(counts.values()))
print("max: ", max(counts.values()))
print("mean: ", mean(counts.values()))

这是我运行它的几次之一的结果（它们都很相似）：

70125 69667 70081 ... 70038 70085 70121 ... 70047 70040 70170
remove failed:  0
min:  69332
max:  70599
mean:  70000.0

最后一点：Python的random.random()是[0.0,1.0]（不包括1.0）。

Answer 4

我相信你正在寻找"Algorithm S" from section 3.4.2 of Knuth (D. E. Knuth, The Art of Computer Programming. Volume 2: Seminumerical Algorithms, second edition. Addison-Wesley, 1981)。

您可以在http://rosettacode.org/wiki/Knuth%27s_algorithm_S

看到多个实施

可能也很有用的Perlmonks list has some Perl implementations of Algorithm S and Algorithm R。

这些算法依赖于浮点数的有意义解释，如3609739 / 158609739,3609738 / 158609738等，其中可能没有足够的分辨率和标准的Float数据类型，除非使用double precision或更大的数字实现Float数据类型。

Answer 5

这是使用Python的可能解决方案：

import random

skipping = random.sample(range(158609739), 3609739)

input = open(input)
output = open(output, 'w')

for i, line in enumerate(input):
    if i in skipping:
        continue
    output.write(line)

input.close()
output.close()

这是另一个使用Mark的方法：

import random

lines_in_file = 158609739
lines_left_in_file = lines_in_file
lines_to_delete = lines_in_file - 155000000

input = open(input)
output = open(output, 'w')

try:
    for line in input:
        current_probability = lines_to_delete / lines_left_in_file
        lines_left_in_file -= 1
        if random.random < current_probability:
            lines_to_delete -= 1
            continue
        output.write(line)
except ZeroDivisionError:
    print("More than %d lines in the file" % lines_in_file)
finally:
    input.close()
    output.close()

Answer 6

在看到Darren Yin表达了其原则之前，我写了这段代码。

我修改了我的代码，使用名称skipping（我不敢选择kangaroo ...）和来自Ethan Furman的关键字continue，代码是原则也一样。

我为函数的参数定义了默认参数，以便可以多次使用该函数，而无需在每次调用时重新分配。

import random
import os.path

def spurt(ff,skipping):
    for i,line in enumerate(ff):
        if i in skipping:
            print 'line %d excluded : %r' % (i,line)
            continue
        yield line

def randomly_reduce_file(filepath,nk = None,
                         d = {0:'st',1:'nd',2:'rd',3:'th'},spurt = spurt,
                         sample = random.sample,splitext = os.path.splitext):

    # count of the lines of the original file
    with open(filepath) as f:  nl = sum(1 for _ in f)

    # asking for the number of lines to keep, if not given as argument
    if nk is None:
        nk = int(raw_input('  The file has %d lines.'
                           '  How many of them do you '
                           'want to randomly keep ? : ' % nl))

    # transfer of the lines to keep,
    # from one file to another file with different name
    if nk<nl:
        with open(filepath,'rb') as f,\
             open('COPY'.join(splitext(filepath)),'wb') as g:
            g.writelines(  spurt(f,sample(xrange(0,nl),nl-nk) )  )
            # sample(xrange(0,nl),nl-nk) is the list
            # of the counting numbers of the lines to be excluded 
    else:
        print '   %d is %s than the number of lines (%d) in the file\n'\
              '   no operation has been performed'\
              % (nk,'the same' if nk==nl else 'greater',nl)

Answer 7

使用$RANDOM变量，您可以获得0到32,767之间的随机数。

有了这个，您可以读取每一行，看看$ RANDOM是否小于155,000,000 / 158,609,739 * 32,767（即32,021），如果是，请让该行通过。

当然，这不会给你完全 150,000,000行，但非常接近它，具体取决于随机数生成器的正常性。

编辑：以下是一些可以帮助您入门的代码：

#!/bin/bash
while read line; do
  if (( $RANDOM < 32021 ))
  then
    echo $line
  fi
done

这样称呼：

thatScript.sh <inFile.txt >outFile.txt

如何从一个大文件中随机删除一些行？

7 个答案:

Mark Ransom答案的证据