文本处理:查找换行符分割的多个单词

时间:2015-10-03 12:59:32

标签: regex replace sed

如何在不删除换行符的情况下找到可能被换行符拆分的多个单词?

E.g。

The promotion and merchandise aided the success of We Are
the World and raised over $63 million for humanitarian
aid in Africa and the US.

使用sed(或任何* nix文本处理工具,例如awk,perl)搜索We Are the World并将其替换为例如<song title>所以它显示为:

The promotion and merchandise aided the success of <song title>
and raised over $63 million for humanitarian
aid in Africa and the US.

我有一堆搜索模式(歌曲标题),我想搜索文本片段并用<song title>替换所有这些模式。我不想删除换行符。

1 个答案:

答案 0 :(得分:1)

$ cat tst.awk
BEGIN { gsub(/ +/,"[[:space:]]+",old); old = tolower(old) }
{ tail = tail $0 RS }
END {
    head = ""
    while ( match(tolower(tail),old) ) {
        trgt = substr(tail,RSTART,RLENGTH)
        head = head substr(tail,1,RSTART-1) new
        tail = substr(tail,RSTART+RLENGTH)
        if (trgt ~ RS) {
            head = head RS
            sub(/^[[:blank:]]+/,"",tail)
        }
    }
    printf "%s%s", head, tail
}

$ awk -v old='we are the world' -v new='<song title>' -f tst.awk file
The promotion and merchandise aided the success of <song title>
and raised over $63 million for humanitarian
aid in Africa and the US.

以上假设您在旧歌曲标题中处理换行符的要求是将换行符添加到新歌曲标题的末尾,并删除旧歌曲标题后面的任何空白字符。