我有一个以逗号分隔的csv文件,其中包含一个形容词列表。我需要提取每个的根和后缀。用AWK可以做到这一点吗?
输入文件:
_
所需的输出文件:
ypperlig;ypperlig;adj.;1
ypperlig;ypperlige;adj.;2
ypperlig;ypperligt;adj.;3
ypperlig;ypperligst;adj.;5
vunden;vunden;adj.;1
vunden;vundne;adj.;2
vunden;vundent;adj.;3
vunden;vundnest;adj.;5
如果第4列中缺少序列号,就像在这两个例子中一样,空格必须用星号代替。
hek2mgl代码:
ypperlig,ypperlig,adj., ,e,t,*,st
vunden,vund,adj., ,ne,ent,*,nest
输出:
BEGIN{
FS=";"
}
{
split($1,a,"")
split($2,b,"")
s=""
for(i in a)
{
if(b[i]!=a[i])
{
break;
}
s = s "" a[i]
}
stem[$1]=s;
type[$1] = $3
}
{
suf[$1] = suf[$1] "," substr($2,length(stem[$1])+1)
}
END {
for(i in stem)
{
printf "%s,%s, %s\n",i,stem[i],type[i],suf[i]
}
}
答案 0 :(得分:2)
可能是的,但它需要更复杂的awk程序:
script.awk :
BEGIN{
FS=","
}
# Get the stem and type through comparison between $1 and $2
!stem[$1]{
split($1,a,"")
split($2,b,"")
s=""
for(i in a){
if(b[i]!=a[i]) {
break;
}
s = s "" a[i]
}
stem[$1] = s
type[$1] = $3
}
# Get suffix from $2
{
suf[$1] = suf[$1] "," substr($2,length(stem[$1]) + 1)
}
# Print
END {
for(i in stem) {
printf "%s,%s,%s, %s\n",i,stem[i],type[i],suf[i]
}
}
称之为:
awk -f script.awk input.file
注意:awk
会搞乱输入排序顺序。如果您关心这一点,您可能会将输出通过管道排序:
awk -f script.awk input.file | sort