创建匹配括号-awk:sed

时间:2016-03-21 10:12:59

标签: regex awk sed

我有一个有三种模式的数据集:

首先:

abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix

第二

inaccurate in:prefix<>accurate:stem
inactive in:prefix<>active:stem

第三

incommunicable in:prefix<>communicate:stem<>able:suffix
incompatibility in:prefix<>compatible:stem<>ity:suffix

我需要将上述内容转换为以下形式:匹配Penn Tree Bank(http://languagelog.ldc.upenn.edu/myl/PennTreebank1995.pdf)的方括号

首先:

abrasion ((abrade:stem) ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)

第二:

inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))

第三:

incommunicable (in:prefix ((communicate:stem)able:suffix))
incompatibility (in:prefix ((compatible:stem)ity:suffix))

代码,我正在使用awk

{
    n = gsub(/<>/,")",$2)
    s = sprintf("%*s",n,"")
    gsub(/ /,"(",s)
    print "(" $1, s "((" $2 "))"
}

修改

更复杂的表格

nationalistic national: stem <>ism:suffix<>ist:suffix<>ic:suffix 

为:

nationalistic ((((national: stem) ism:suffix)ist:suffix)ic:suffix)

它没有产生示例中提到的预期输出。

4 个答案:

答案 0 :(得分:1)

模式1的预期输出可能有问题,括号未配对,我猜测这是拼写错误,它应该是:

abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)

我制作了这个awk脚本:

awk -v d="<>" '{$2="("$2")"}
$1~/^ab/{sub(d,")",$2);$2="(" $2}
$1~/^ina/{sub(d,"(",$2);$2=$2")"}
$1~/^inc/{sub(d,"((",$2);sub(d,")",$2);$2=$2")"}7' file

将3个模式示例放在同一个文件中,它给出了:

abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))

答案 1 :(得分:1)

这应该足够通用,因为它会考虑:stem:prefix:suffix进行匹配:

awk 'BEGIN{FS=OFS="\n"}{
  a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
  b=gensub(/(\([a-zA-Z]*:stem\))<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
  c=gensub(/([a-zA-Z]*:prefix)<>(.*)/,"(\\1\\2)", "g", b);
  print c;}' testfile

在这里演示:https://ideone.com/U3ux91

修改

这应该注意多个后缀和前缀:

awk 'BEGIN{FS=OFS="\n"}{
   a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
   while ( a ~ /stem)<>.*:suffix/) {
     a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
   }
   while ( a ~ /<>/) {
     a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(\\1\\2)", "g", a);
   }
   print a;}' test

在这里演示:https://ideone.com/U7LYXi (对不起,如果说民族主义不是一个词,而是为了测试...)

答案 2 :(得分:1)

awk -F'<>| ' -v OFS= '{ 
    $1 = $1 " " 
    for (i=2; i<=NF; i++) { 
        if ($i ~ /prefix$/)    { $i = "(" $i; $NF = $NF ")" } 
        if ($i ~ /stem\)?$/)   { stem = i; $i = "(" $i ")" } 
        if ($i ~ /suffix\)?$/) { $i = $i ")"; $stem = "(" $stem } } 
    } { print }'

答案 3 :(得分:0)

awk救援!

$ awk 'function wrap(v) {return "("v")"; }
      {n=split($2,a,"<>"); 
       if(n==3) w=wrap(a[1] wrap(wrap(a[2]) a[3])); 
       else if(a[1]~/:prefix/) w=wrap(a[1] wrap(a[2])); 
       else w=wrap(wrap(a[1]) a[2]);
       print $1, w}' stems

abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))