Question

我目前正在尝试grep针对更大的csv文件（3.000.000行）的大量ID（~5000）。

我想要所有csv行，其中包含来自id文件的id。

我天真的做法是：

cat the_ids.txt | while read line
do
  cat huge.csv | grep $line >> output_file
done

但这需要永远！

这个问题有更有效的方法吗？

Answer 1

尝试

grep -f the_ids.txt huge.csv

此外，由于您的模式似乎是固定字符串，因此提供-F选项可能会加快grep。

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched.  (-F is specified by
          POSIX.)

Answer 2

使用grep -f：

grep -f the_ids.txt huge.csv > output_file

来自man grep：

-f FILE， - file = FILE

从FILE获取模式，每行一个。空文件包含零   模式，因此没有匹配。（-f由POSIX指定。）

如果您提供一些样本输入，我们甚至可以更多地改善grep条件。

测试

$ cat ids
11
23
55
$ cat huge.csv 
hello this is 11 but
nothing else here
and here 23
bye

$ grep -f ids huge.csv 
hello this is 11 but
and here 23

Answer 3

当grep -f filter.txt data.txt大于几千行时，

filter.txt变得难以驾驭，因此不是这种情况的最佳选择。即使在使用grep -f时，我们也需要记住以下几点：

如果需要匹配第二个文件中的整行，请使用-x选项
如果第一个文件包含字符串而不是模式

-F

在不使用-w选项

-x

这篇文章对这个主题进行了很好的讨论（大文件grep -f）：

Fastest way to find lines of a file from another larger file in Bash

这篇文章谈到grep -vf：

grep -vf too slow with large files

总之，处理大型文件grep -f的最佳方法是：

匹配整行：

awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt

匹配第二个文件中的特定字段（在此示例中使用'，'分隔符和字段2）：

awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt

和grep -vf：

匹配整行：

awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt

匹配第二个文件中的特定字段（在此示例中使用'，'分隔符和字段2）：

awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt

Answer 4

使用ugrep可能会大大提高搜索速度，以匹配大型the_ids.txt文件中huge.csv中的字符串：

ugrep -F -f the_ids.txt huge.csv

这也适用于GNU grep，但是我希望ugrep的运行速度快几倍。

grep针对大文件的大型列表

4 个答案:

测试