句点后删除第一个大写单词

时间:2019-06-13 14:52:49

标签: linux bash awk

我希望能够删除句点之后大写的第一个单词。目标是即使同一行上的句子是两个,也要删除大写字母中的第一个单词。实际上,正如我将在示例中显示的那样,该行的第一个单词已被省略,但是第二个句子的第一个单词出现了。

对于第一行的第一句话,我通过从2而不是1开始if来解决了这个问题。

这是代码

BEGIN { FS="[^[:alpha:]']+"; OFS=" "} 
{
   parola=" "
   max_nr=0

   prec=""

   for (i=2; i<=NF; i++) {
        if ($i ~ /[[:punct:][:digit:]]+[:space:]*[A-Z][']{0,1}[A-Z]{0,1}[a-z]+/){
            continue
        }
        else{
            if ($i ~ /[A-Z][']{0,1}[A-Z]{0,1}[a-z]+/){

                if(!(prec=="")){

                    prec=prec" "$i
                }
                else{
                    prec=$i              
                }
            }     
            else {

                if(!(prec=="")){

                    words[prec]
                    prec=""    
                  }
            }

            if (i==NF) {
                max_nr=max_nr+1  
                for (word1 in words) {
                    for (word2 in words) {
                        if (word1 != word2) {
                            print parola"" word1","word2
                        }
                    }

                    delete words[word1]
                }                
            }
            }
}  
}   
END{
    print FILENAME" "FNR
    print i
    print max_nr
}

这是test.txt的内容:

Today Jonathan played soccer with Martin. After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda. 
Paolo went to Lisbon with an Easyjet plane. During the trip he met two of his dear friends, Peter and John.

这是命令的结果:

  

awk -f script.awk test.txt> output.csv

Lisbon,During
Lisbon,John
Lisbon,Peter
Lisbon,Easyjet
During,John
During,Peter
During,Easyjet
John,Peter
John,Easyjet
Peter,Easyjet
Jonathan,Martin After
Jonathan,Lemon Soda
Jonathan,Martin
Martin After,Lemon Soda
Martin After,Martin
Lemon Soda,Martin

预期输出应为:

Lisbon,John
Lisbon,Peter
Lisbon,Easyjet
John,Peter
John,Easyjet
Peter,Easyjet
Jonathan,Martin
Martin,Lemon Soda
Jonathan,Lemon Soda

有什么建议吗?

2 个答案:

答案 0 :(得分:2)

不尝试为您完成整个工作(I provided a solution for that previously),只需解决您在此问题中提出的特定问题即可:

您使用的是FS="[^[:alpha:]']+",因此无法分辨给定的任何字段(“单词”)之前的分隔符是.还是其他。使用FS='[.]'或类似的起点,然后您会知道每个字段之前的分隔符是行的开头或.,然后可以使用split($i,f,/[^[:alpha:]']+/)来分隔每个子项-字段(“句子”)中的-field(“单词”)。例如:

$ cat tst.awk
BEGIN { FS="[[:space:]]*[.][[:space:]]*" }
{
    for (sentenceNr=1; sentenceNr<=NF; sentenceNr++) {
        sentence = $sentenceNr
        numWords = split(sentence,words,/[^[:alpha:]\047]+/)
        for (wordNr=2; wordNr<=numWords; wordNr++) {
            word = words[wordNr]
            if ( word ~ /^[[:upper:]]/ ) {
                print NR, sentenceNr, wordNr, word
            }
        }
    }
}

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John

请注意,输入以下内容:

$ cat file
Today Jonathan played soccer with Martin. After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda.
Paolo went to Lisbon with an EasyJet plane. During the trip he met two of his dear friends, Peter and John.
May lost her home. 10 Downing St is where the PM lives.

以上将输出:

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John
3 2 2 Downing
3 2 3 St
3 2 7 PM

如果“ Downing”不存在,则将代码更改为:

$ cat tst.awk
BEGIN { FS="[[:space:]]*[.][[:space:]]*" }
{
    for (sentenceNr=1; sentenceNr<=NF; sentenceNr++) {
        numWords = split($sentenceNr,words,/[^[:alpha:]\047]+/)
        isSubsequent = 0
        for (wordNr=1; wordNr<=numWords; wordNr++) {
            word = words[wordNr]
            if ( word ~ /^[[:upper:]]/ ) {
                if ( isSubsequent++ ) {
                    print NR, sentenceNr, wordNr, word
                }
            }
        }
    }
}

$ awk -f tst.awk file
1 1 2 Jonathan
1 1 6 Martin
1 2 4 Martin
1 2 6 Jonathan
1 2 14 Lemon
1 2 15 Soda
2 1 4 Lisbon
2 1 7 EasyJet
2 2 11 Peter
2 2 13 John
3 2 3 St
3 2 7 PM

答案 1 :(得分:1)

以下内容假定您的文本遵循最基本的标点规则。 标点符号后跟一个空格。如果有,您可以使用GNU awk通过定义记录和字段模式来轻松提取您感兴趣的单词。假设一条记录是一个句子,以以下任何字符.?!结尾。模式可以识别大写单词:[A-Z][a-z]*现在很容易:

awk 'BEGIN{ RS="[.?!][[:space:]]*"; FPAT="([[:space:]]+[[:upper:]][[:alnum:]]*)+"}
     { print "record",NR,":",$0 }
     { for(i=1;i<=NF;++i) print "field",i,":",$i }' file

在这里,我们更新记录分隔符RS,以包含[[:space:]]类中的各种可能的空格字符。这样可以确保第一个单词前面没有空格。然后可以通过检查字段模式FPAT="([[:space:]][[:upper:]][[:alnum:]]*)+"来拾取所有其他大写的单词,这些模式代表由一般的空格分隔的大写单词的序列。请注意,字段始终以空白或换行符开头。只需简单替换即可轻松清除:

这将输出:

record 1 : Today Jonathan played soccer with Martin
field 1 :  Jonathan
field 2 :  Martin
record 2 : After the game, Martin and Jonathan were thirsty and then drank a fresh Lemon Soda
field 1 :  Martin
field 2 :  Jonathan
field 3 :  Lemon Soda
record 3 : Paolo went to Lisbon with an Easyjet plane
field 1 :  Lisbon
field 2 :  Easyjet
record 4 : During the trip he met two of his dear friends, Peter and John
field 1 :  Peter
field 2 :  John

现在可以适应OP的问题(对字段进行空间校正):

awk 'BEGIN{ RS="[.?!][[:space:]]*"; FPAT="([[:space:]]+[[:upper:]][[:alnum:]]*)+"}
     { for (i=1;i<=NF;++i) { 
           w=$i; gsub(/[[:space:]]+/," ",w);
           w=substr(w,2); words[w]
       }
     }
     { for (w1 in words) { 
           for (w2 in words) if(w1 != w2) print w1,w2
           delete words[w1]
        }
     }' file

返回:

Jonathan Martin
Jonathan Lemon Soda
Jonathan Martin
Lemon Soda Martin
Lisbon Easyjet
John Peter