Question

我有一个以逗号分隔的csv文件，其中包含一个形容词列表。我需要提取每个的根和后缀。用AWK可以做到这一点吗？

输入文件：

所需的输出文件：

ypperlig;ypperlig;adj.;1
ypperlig;ypperlige;adj.;2
ypperlig;ypperligt;adj.;3
ypperlig;ypperligst;adj.;5
vunden;vunden;adj.;1
vunden;vundne;adj.;2
vunden;vundent;adj.;3
vunden;vundnest;adj.;5

如果第4列中缺少序列号，就像在这两个例子中一样，空格必须用星号代替。

hek2mgl代码：

ypperlig,ypperlig,adj., ,e,t,*,st
vunden,vund,adj., ,ne,ent,*,nest

输出：

BEGIN{
FS=";"
}

{
split($1,a,"")
split($2,b,"")

s=""
for(i in a)
{ 
    if(b[i]!=a[i])
    {
        break;
    }
    s = s "" a[i]
}

    stem[$1]=s;
    type[$1] = $3
}

{
    suf[$1] = suf[$1] "," substr($2,length(stem[$1])+1)
}


END {
for(i in stem) 
{
    printf "%s,%s, %s\n",i,stem[i],type[i],suf[i]
}   
}

Answer 1

可能是的，但它需要更复杂的awk程序：

script.awk ：

BEGIN{
    FS=","
}

# Get the stem and type through comparison between $1 and $2
!stem[$1]{
    split($1,a,"")
    split($2,b,"")

    s=""
    for(i in a){ 
        if(b[i]!=a[i]) {
            break;
        }
        s = s "" a[i]
    }   

    stem[$1] = s 
    type[$1] = $3
}

# Get suffix from $2
{
    suf[$1] = suf[$1] "," substr($2,length(stem[$1]) + 1)
}

# Print 
END {
    for(i in stem) {
        printf "%s,%s,%s, %s\n",i,stem[i],type[i],suf[i]
    }   
}

称之为：

awk -f script.awk input.file

注意：awk会搞乱输入排序顺序。如果您关心这一点，您可能会将输出通过管道排序：

awk -f script.awk input.file | sort

在AWK中将单词解析为（root，suffix）

1 个答案: