我有一个有三种模式的数据集:
首先:
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
第二
inaccurate in:prefix<>accurate:stem
inactive in:prefix<>active:stem
第三
incommunicable in:prefix<>communicate:stem<>able:suffix
incompatibility in:prefix<>compatible:stem<>ity:suffix
我需要将上述内容转换为以下形式:匹配Penn Tree Bank(http://languagelog.ldc.upenn.edu/myl/PennTreebank1995.pdf)的方括号
首先:
abrasion ((abrade:stem) ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
第二:
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
第三:
incommunicable (in:prefix ((communicate:stem)able:suffix))
incompatibility (in:prefix ((compatible:stem)ity:suffix))
代码,我正在使用awk
{
n = gsub(/<>/,")",$2)
s = sprintf("%*s",n,"")
gsub(/ /,"(",s)
print "(" $1, s "((" $2 "))"
}
修改
更复杂的表格
nationalistic national: stem <>ism:suffix<>ist:suffix<>ic:suffix
为:
nationalistic ((((national: stem) ism:suffix)ist:suffix)ic:suffix)
它没有产生示例中提到的预期输出。
答案 0 :(得分:1)
模式1的预期输出可能有问题,括号未配对,我猜测这是拼写错误,它应该是:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
我制作了这个awk脚本:
awk -v d="<>" '{$2="("$2")"}
$1~/^ab/{sub(d,")",$2);$2="(" $2}
$1~/^ina/{sub(d,"(",$2);$2=$2")"}
$1~/^inc/{sub(d,"((",$2);sub(d,")",$2);$2=$2")"}7' file
将3个模式示例放在同一个文件中,它给出了:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))
答案 1 :(得分:1)
这应该足够通用,因为它会考虑:stem
,:prefix
和:suffix
进行匹配:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
b=gensub(/(\([a-zA-Z]*:stem\))<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
c=gensub(/([a-zA-Z]*:prefix)<>(.*)/,"(\\1\\2)", "g", b);
print c;}' testfile
在这里演示:https://ideone.com/U3ux91
修改强>
这应该注意多个后缀和前缀:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
while ( a ~ /stem)<>.*:suffix/) {
a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
}
while ( a ~ /<>/) {
a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(\\1\\2)", "g", a);
}
print a;}' test
在这里演示:https://ideone.com/U7LYXi (对不起,如果说民族主义不是一个词,而是为了测试...)
答案 2 :(得分:1)
awk -F'<>| ' -v OFS= '{
$1 = $1 " "
for (i=2; i<=NF; i++) {
if ($i ~ /prefix$/) { $i = "(" $i; $NF = $NF ")" }
if ($i ~ /stem\)?$/) { stem = i; $i = "(" $i ")" }
if ($i ~ /suffix\)?$/) { $i = $i ")"; $stem = "(" $stem } }
} { print }'
答案 3 :(得分:0)
awk
救援!
$ awk 'function wrap(v) {return "("v")"; }
{n=split($2,a,"<>");
if(n==3) w=wrap(a[1] wrap(wrap(a[2]) a[3]));
else if(a[1]~/:prefix/) w=wrap(a[1] wrap(a[2]));
else w=wrap(wrap(a[1]) a[2]);
print $1, w}' stems
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))