Question

给定一个包含文本的大文件（每行一个句子），任务是提取N个令牌（例如30亿个令牌中的1亿个），因为我不能将句子分成几部分，我需要找到包含给定令牌数的最接近行数。

我尝试了以下代码：

perl -p -e 's/\n/ #/g' huge_file | cut -d' ' -f1-100000000 | grep -o ' #' | wc -w

用符号替换换行符号＆＃39; ＃＆＃39; （我们基本上将句子加入单行）并计算符号的数量＆＃39; ＃＆＃39;这应该与句子数相对应（huge_file不包含＆＃39;＃＆＃39;符号）。但是，grep无法处理大行并且让grep：内存耗尽＆＃39;错误。有没有其他有效的方法来完成任务，这也适用于非常大的文件？

Answer 1

我有点难以理解你的问题。但我认为你非常糟糕地解决它。将perl作为super-sed运行，然后cut，然后grep然后wc的效率非常低。

如果我理解正确，你需要尽可能多的行来获得至少100M的单词。

为什么不改为：

#!/usr/bin/env perl

use strict;
use warnings;

my $wordcount = 0; 

#use 'magic' filehandle - read piped input or 
#command line specified 'myscript.pl somefilename' - just like sed/grep
while ( <> )  
    #split on whitespace, count number of fields. Or words in this case. 
    $wordcount += scalar split; 
    #chomp; if you don't want the line feed here
    #print current line 
    print; 
    #bail out if our wordcount is above a certain number. 
    last if $wordcount >= 100_000_000
    #NB $. is line number if you wanted to just do a certain number of lines. 
} 

#already printed the content with line feeds intact. 
#this prints the precise count we've printed. 
print $wordcount," words printed\n";

这将迭代你的文件，一旦你看到100M字，它就会挽救 - 这意味着你不再需要阅读整个文件，也不需要调用菊花链命令。

如果你真的坚持下去，那就会有所作为：

perl -p -e '$wordcount += scalar split; last if $wordcount > 100_000_000;'

再次 - 我无法确定换行符和#符号的重要性，因此我没有对它们做任何事情。但是s/\n/ #/在上面的代码块中运行正常，chomp;也可以删除尾随换行符，如果这是您之后的内容。

给定令牌的数量，如何在文件中找到相应的最接近的行数？

1 个答案: