在AWK中将单词解析为(root,suffix)

时间:2015-12-27 16:54:04

标签: bash awk

我有一个以逗号分隔的csv文件,其中包含一个形容词列表。我需要提取每个的根和后缀。用AWK可以做到这一点吗?

输入文件:

_

所需的输出文件:

ypperlig;ypperlig;adj.;1
ypperlig;ypperlige;adj.;2
ypperlig;ypperligt;adj.;3
ypperlig;ypperligst;adj.;5
vunden;vunden;adj.;1
vunden;vundne;adj.;2
vunden;vundent;adj.;3
vunden;vundnest;adj.;5

如果第4列中缺少序列号,就像在这两个例子中一样,空格必须用星号代替。

hek2mgl代码:

ypperlig,ypperlig,adj., ,e,t,*,st
vunden,vund,adj., ,ne,ent,*,nest

输出:

BEGIN{
FS=";"
}

{
split($1,a,"")
split($2,b,"")

s=""
for(i in a)
{ 
    if(b[i]!=a[i])
    {
        break;
    }
    s = s "" a[i]
}

    stem[$1]=s;
    type[$1] = $3
}

{
    suf[$1] = suf[$1] "," substr($2,length(stem[$1])+1)
}


END {
for(i in stem) 
{
    printf "%s,%s, %s\n",i,stem[i],type[i],suf[i]
}   
}

1 个答案:

答案 0 :(得分:2)

可能是的,但它需要更复杂的awk程序:

script.awk

BEGIN{
    FS=","
}

# Get the stem and type through comparison between $1 and $2
!stem[$1]{
    split($1,a,"")
    split($2,b,"")

    s=""
    for(i in a){ 
        if(b[i]!=a[i]) {
            break;
        }
        s = s "" a[i]
    }   

    stem[$1] = s 
    type[$1] = $3
}

# Get suffix from $2
{
    suf[$1] = suf[$1] "," substr($2,length(stem[$1]) + 1)
}

# Print 
END {
    for(i in stem) {
        printf "%s,%s,%s, %s\n",i,stem[i],type[i],suf[i]
    }   
}

称之为:

awk -f script.awk input.file

注意:awk会搞乱输入排序顺序。如果您关心这一点,您可能会将输出通过管道排序:

awk -f script.awk input.file | sort