Question

我有2500万行的这个文件。我想从这个文件获得特定的1000万行

我在另一个文件中有这些行的索引。我怎样才能有效地做到这一点？

Answer 1

假设行列表在文件list-of-lines中且数据在data-file中，并且list-of-lines中的数字按升序排列，那么您可以写：< / p>

current=0
while read wanted
do
     while ((current < wanted))
     do
         if read -u 3 line
         then ((current++))
         else break 2
         fi
     done
     echo "$line"
done < list-of-lines 3< data-file

这使用Bash扩展，允许您指定应从哪个文件描述符read读取（read -u 3以从文件描述符3中读取）。要打印的行号列表从标准输入读取;从文件描述符3中读取数据文件。这使得一个文件通过这两个文件中的每一个，这是在最佳的常数因子内。

如果list-of-lines未排序，请使用以下内容替换最后一行，该行使用名为process substitution的Bash扩展名：

done < <(sort -n list-of-lines) 3< data-file

Answer 2

假设包含行索引的文件被调用＆＃34; no.txt＆＃34;数据文件是＆＃34; input.txt＆＃34;。

awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt

output.txt将包含所需的行。我不确定这是否足够有效。在我的环境下，它似乎比这个脚本（https://stackoverflow.com/a/22926494/3264368）更快。

一些解释：

第一个命令预处理索引文件，以便使用前导零和宽度8正确调整数字（因为input.txt中的行数已知为25M）
第二个命令将打印行和行号，其格式与预处理索引文件中的格式完全相同，然后将它们连接起来以获取所需行（剪切以删除行号）。

Answer 3

既然你说你要查找的行的文件是排序的，你可以在awk中遍历这两个文件：

awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt

这将精确读取每个文件中的每一行。

Answer 4

与您的索引文件类似index.txt且数据文件为data.txt，您可以使用sed执行此操作，如下所示

#!/bin/bash
while read line_no
do
    sed ''$line_no'q;d' data.txt
done < input.txt

Answer 5

你可以运行一个从2500万行文件中读取的循环，当循环计数器到达你想要的行号时，告诉它写入该行。 EX：

String line = "";
int count = 0;
while((line = br.readLine())!=null)
{

if(count == indice)
{
System.out.println(line) //or file write

}

获取文件的特定行

5 个答案: