如何在带有Regex的列表中的每个匹配的第一次出现周围添加`\ macro {}`

时间:2013-07-07 07:02:46

标签: regex perl sed awk

我有一个单词列表list.txt,如下所示:

fish
squirrel
bird
tree
mountain

我还有一个文件text.txt,上面有这样的段落:

The fish ate the birds.
The squirrel lived in the tree on the mountain.
The fish did not like eating squirrels as they lived too high in the trees.

我需要在list.txt文件中标记text.txt中所有单词的第一次出现,使用TeX代码,例如\macro{},例如,输出看起来像这样:

The \macro{fish} ate the \macro{bird}s.
The \macro{squirrel} lived in the \macro{tree}house on the \macro{mountain}.
The fish did not like eating squirrels as they lived too high in the trees.

如何将\macro{}添加到BASH列表中出现的每个单词的第一个出现位置?

4 个答案:

答案 0 :(得分:2)

GNU代码

$ sed -nr 's#(\w+)#s/\1/\1/;T\1;x;s/\1/\1/;x;t\1;x;s/.*/\& \1/;x;s/\1/\\\\macro\{\1\}/;:\1;$!N#p' list.txt|sed -rf - text.txt
$ cat list.txt
fish
squirrel
bird
tree
mountain

$ cat text.txt
The fish ate the birds.
The squirrel lived in the tree on the mountain.
The fish did not like eating squirrels as they lived too high in the trees.

$ sed -nr 's#(\w+)#s/\1/\1/;T\1;x;s/\1/\1/;x;t\1;x;s/.*/\& \1/;x;s/\1/\\\\macro\{\1\}/;:\1;$!N#p' list.txt|sed -rf - text.txt
The \macro{fish} ate the \macro{bird}s.
The \macro{squirrel} lived in the \macro{tree} on the \macro{mountain}.
The fish did not like eating squirrels as they lived too high in the trees.

答案 1 :(得分:1)

好&有趣的问题。

我可以为你提出以下awk:

awk 'NR==FNR{a[$1]=$1;next} 
   {for (v in a) if (a[v] != "") {r=sub(v, "\\macro{" v "}"); if (r) a[v]=""}
   }'1 list.txt text.txt 

答案 2 :(得分:1)

这将保留空白区域(与分配给字段的任何解决方案不同),并且在查找“the”时不会错误地匹配“there”的前2个字母(不同于任何不包含“word”的解决方案)单词分隔符“< ...>”或等效的)

$ gawk 'NR==FNR{list[$0];next}
    {
        for (word in list)
            if ( sub("\\<"word"\\>","\\macro{&}") )
                delete list[word]
    }
1' list.txt text.txt
The \macro{fish} ate the birds.
The \macro{squirrel} lived in the \macro{tree} on the \macro{mountain}.
The fish did not like eating squirrels as they lived too high in the trees.

这个解决方案唯一需要注意的是,如果“word”包含任何RE元字符(例如*,+),它们将由sub()进行评估。因为您似乎使用了不会发生的英语单词,但是如果它可以告诉我们您需要不同的解决方案。

我看到你发布了部分匹配实际上是可取的(例如“the”应该匹配“理论”的开头)所以你想要这个:

$ awk 'NR==FNR{list[$0];next}
    {
        for (word in list)
            if ( sub(word,"\\macro{&}") )
                delete list[word]
    }
1' list.txt text.txt

只要没有RE元字符可以出现在list.txt的匹配单词中,否则就会出现:

$ awk 'NR==FNR{list[$0];next}
    {
        for (word in list)
            start = index($0,word)
            if ( start > 0 ) {
                $0 = substr($0,1,start-1) \
                     "\\macro{" word "}"  \
                     substr($0,start+length(word))
                delete list[word]
            }
    }
1' list.txt text.txt

最后一个是最强大的解决方案,因为它进行字符串比较而不是RE比较,所以不受RE元字符的影响,也不会影响空格(我知道你说你现在不在乎)。

答案 3 :(得分:1)

我还是Awk的新手,但这似乎有效。只要注意像&#34; propane&#34;在寻找&#34; prop&#34; (并且你不能匹配确切的词,因为&#34;道具&#34;不会被改为&#34; \ macro {prop} s&#34;)。你需要一个更好的字典,而且可能需要的不仅仅是Awk来处理这样的案例。

NR==FNR {
    #Skip empty lines.
    if ($0 ~ /^$/)
        next;
    macros[$0] = "\\macro{"$0"}";
    next;
}
{
    for (name in macros) {
        n = name;
        #Sometimes a word may have a [ in it or other special chars.
        gsub(/[.[\(*+?{|^$]/, "[&]", n);
        if (sub(n, macros[name]))
            delete macros[name];
    }
    print;
}