Question

我有一个包含一堆字符串的文件。我有另一个包含一堆单词的文件。我想打印第一个文件中包含第二个文件中前20个单词之一的所有行。我一直试图用sed这样做，但grep或awk是一个更好的选择吗？

Answer 1

问题是关于“单词”......并且......我想了很多关于这意味着什么，同时试图尽可能少地假设关于file2的格式 - 想想也许file2是另一本书，也许是一个短语，或者可能是逗号或制表符分隔列表。

我们可能希望匹配整个单词，使得file2中的“home”与file1中的“homely”不匹配。
带有数字，短划线，加号等的字符串不是英文单词，不应该被考虑。
应保留连字符和所有格。
因为我们匹配“单词”，所以应该忽略大小写（此功能很容易逆转）

如果允许我们对file2的格式设置限制，请阅读最后的简化egrep / sed脚本答案。

以下答案首先在子shell中对file2进行操作，处理标点符号和分隔符，识别前20个有效单词，然后从有效单词列表中构建正则表达式。然后，该脚本应用正则表达式（子shell的结果）来过滤file1。

egrep -i $(tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') file1

进一步解释......如果我们将以下file2作为示例：

$ cat file2
1The quick brown fox
jumps over- Frank's (empty-headed) lazy dog.

子shell管道中的tr语句过滤掉不需要的分隔符，并将候选词放在返回分隔列表中：

$ tr -c "[:alnum:]-'" '\n' < file2
1The
quick
brown
fox
jumps
over-
Frank's

empty-headed

lazy
dog

子shell管道中的awk语句过滤有效单词并打印最多20个单词。

$ tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }"
quick
brown
fox
jumps
Frank's
empty-headed
lazy
dog

子shell管道中的最后一个语句将单词列表格式化为正则表达式。

$ tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/'
\<quick\>|\<brown\>|\<fox\>|\<jumps\>|\<Frank's\>|\<empty-headed\>|\<lazy\>|\<dog\>

如果我们使用egrep根据众所周知的文本过滤此表达式：

$ egrep -i "\<quick\>|\<brown\>|\<fox\>|\<jumps\>|\<Frank's\>|\<empty-headed\>|\<lazy\>|\<dog\>" kjv.txt | head -n 5
Ge30:32 I will pass through all thy flock to day, removing from thence all the speckled and spotted cattle, and all the brown cattle among the sheep, and the spotted and speckled among the goats: and of such shall be my hire.
Ge30:33 So shall my righteousness answer for me in time to come, when it shall come for my hire before thy face: every one that is not speckled and spotted among the goats, and brown among the sheep, that shall be counted stolen with me.
Ge30:35 And he removed that day the he goats that were ringstraked and spotted, and all the she goats that were speckled and spotted, and every one that had some white in it, and all the brown among the sheep, and gave them into the hand of his sons.
Ge30:40 And Jacob did separate the lambs, and set the faces of the flocks toward the ringstraked, and all the brown in the flock of Laban; and he put his own flocks by themselves, and put them not unto Laban's cattle.
Exo11:7 But against any of the children of Israel shall not a dog move his tongue, against man or beast: that ye may know how that the LORD doth put a difference between the Egyptians and Israel.

全部放在一起......

egrep -i $(tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') kjv.txt | head -n 5
Ge30:32 I will pass through all thy flock to day, removing from thence all the speckled and spotted cattle, and all the brown cattle among the sheep, and the spotted and speckled among the goats: and of such shall be my hire.
Ge30:33 So shall my righteousness answer for me in time to come, when it shall come for my hire before thy face: every one that is not speckled and spotted among the goats, and brown among the sheep, that shall be counted stolen with me.
Ge30:35 And he removed that day the he goats that were ringstraked and spotted, and all the she goats that were speckled and spotted, and every one that had some white in it, and all the brown among the sheep, and gave them into the hand of his sons.
Ge30:40 And Jacob did separate the lambs, and set the faces of the flocks toward the ringstraked, and all the brown in the flock of Laban; and he put his own flocks by themselves, and put them not unto Laban's cattle.
Exo11:7 But against any of the children of Israel shall not a dog move his tongue, against man or beast: that ye may know how that the LORD doth put a difference between the Egyptians and Israel.

解决方案在我一年前的笔记本电脑上运行得相当快：

$ wc -lw kjv.txt 
  31102  820736 kjv.txt
$ time egrep -i $(tr -c "[:alnum:]-'" '\n' < file2 | awk "/^[[:alpha:]]+(-[[:alpha:]]+)?('s|s')?$/ { print; i++ } i==20 { exit 0 }" | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') kjv.txt > /dev/null

real    0m0.021s
user    0m0.016s
sys     0m0.000s

简化回答

以上是针对file2“嘈杂”的复杂情况...如果将file2定义为返回分隔的单词列表，那么答案是什么？我们不必检查有效单词？然后我们可以消除前一个子shell管道的前两个阶段：

egrep -i $(head -n20 file2 | sed '1h; 1!H; $!d; g; s/\n/ /g; s/^/\\</; s/ /\\>|\\</g; s/$/\\>/') file1

最后，如果约束与前面的约束相同并且file2中的单词列表是单个空格分隔的，那么解决方案是什么？

egrep -i $(awk 'NF>20{NF=20}1' file2 | sed 's/^/\\</; s/ /\\>|\\</g; s/$/\\>/') file1

Answer 2

解决方案：

sed 20q file2 > temp grep -f temp file1

使用sed打印包含来自另一个文件的字符串的文件中的所有行

2 个答案: