AWK。提取根和后缀

时间:2015-12-30 12:16:51

标签: unix awk

我有一个csv文件,以分号分隔。这个文件包含一个丹麦语词典,我需要从中提取词干和后缀。 我需要在AWK中执行此操作!

文件:

adelig;adelig;adj.;1
adelig;adelige;adj.;2
adelig;adeligt;adj.;3
adelig;adeligst;adj.;5
voksen;voksen;adj.;1
voksen;voksne;adj.;2
voksen;voksent;adj.;3
voksen;voksnest;adj.;5
virkemiddel;virkemiddel;sb.;1
virkemiddel;virkemidlet;sb.;2
virkemiddel;virkemidlets;sb.;3
virkemiddel;virkemiddels;sb.;4
virkemiddel;virkemidlerne;sb.;5
virkemiddel;virkemidlernes;sb.;6
virkemiddel;virkemiddel;sb.;7
virkemiddel;virkemidler;sb.;7
virkemiddel;virkemiddels;sb.;8
virkemiddel;virkemidlers;sb.;8

预期产出:

adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers

第四栏是表格。缺少某些表单时,后缀将替换为星号。像adelig;adelig; ,e,t,*,st一样 如果重复表单(数字),则后缀用分号分隔。与virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers

一样

我开始编写这段代码,但我没有让算法处理多个可能的词干。与 virkemiddel

的情况一样
BEGIN{
FS=";"
}

{

    lemm=$1;
    form=$2;

    if(match(form, lemm) > 0)
    {
        root=lemm;
        sub(root,"",form);
        suf[$1]=suf[$1]","form;
    }
    else
    {
        split($1,a,"");
        split($2,b,"");


        s="";
        for(i in a)
        { 
            if(b[i]!=a[i])
            {
                break;
            }
            s = s "" a[i];
        }
    }
    root=s;

}

4 个答案:

答案 0 :(得分:4)

这里有一些awk代码,用于查找公共前缀长度并确定后缀列表。我没有处理丢失的表格,也没有处理重复的数字,但它应该给你一个开始

#!/usr/bin/gawk -f

BEGIN { FS = OFS = ";" }
{ words[$1] = words[$1] FS $2 }
END {
    for (word in words) {
        sub("^"FS, "", words[word])
        num_words = split(words[word], these_words)
        prefix_length = common_prefix_length(these_words, num_words)

        suffixes = ""
        sep = ""
        for (i=1; i<=num_words; i++) {
            suffixes = suffixes sep substr(these_words[i],prefix_length+1)
            sep = ","
        }
        print word, substr(these_words[1], 1, prefix_length), suffixes
    }
}

function common_prefix_length(w, n                 ,i,j,minlen, char) {
    minlen = length(w[1])
    for (i=2; i<=n; i++) 
        if (length(w[i]) < minlen)
            minlen = length(w[i])

    for (i=1; i <= minlen; i++) {
        char = substr(w[1], i, 1)
        for (j=2; j <= n; j++)
            if (substr(w[j], i, 1) != char)
                return i-1
    }
    return minlen
}

根据您的输入,输出是

voksen;voks;en,ne,ent,nest
virkemiddel;virkemid;del,let,lets,dels,lerne,lernes,del,ler,dels,lers
adelig;adelig;,e,t,st

答案 1 :(得分:2)

这可能是Python的一个很好的起点。它使用os.path.commonprefix从单词列表中获取词干。

import os
import csv

file="a"
prev_word=""
words=[]
data=dict()
csv_reader = csv.DictReader(
    open(file),
    delimiter=";",
    fieldnames=['common','word','type','num']
    )

for row in csv_reader:
    word = row['common']
    if not prev_word or word == prev_word:
        words.append(row['word'])
    else:
        common=os.path.commonprefix(words)
        data[prev_word] = words
        words=[]
    prev_word = word

data[prev_word] = words
for word,values in data.iteritems():
    common = os.path.commonprefix(values)
    suffixes = [i[len(common):] for i in values]
    suffixes = [i if len(i) else '*' for i in suffixes]
    print "%s;%s;%s" %(word,common,','.join(suffixes))

它返回:

voksen;voks;ne,ent,nest
virkemiddel;virkemid;let,lets,dels,lerne,lernes,del,ler,dels,lers
adelig;adelig;*,e,t,st

答案 2 :(得分:2)

TXR中的三个解决方案。首先,使用提取语言构建基于结构的显式数据模型,然后处理结构:

@(do
   (defstruct inflection ()
     word type index)

   (defstruct dict-entry ()
     root variants max-index))
@(collect :vars (dict))
@  (all)
@word;@(skip)
@  (and)
@    (collect :gap 0 :vars (infl))
@word;@variant;@type;@index
@      (bind infl @(new inflection word variant type type index (toint index)))
@    (end)
@    (bind dict @(new dict-entry root word variants infl
                      max-index [find-max infl > (usl index)].index))
@  (end)
@(end)
@(do (each ((d dict))
       (let* ((vs (mapcar (usl word) d.variants))
              (plen (or (pos-if (op < 1) (mapcar (opip uniq length)
                                                 (transpose vs)))
                        (length d.root)))
              (prefix [(first vs) 0..plen]))
         (put-string `@{d.root};@prefix; `)
         (each ((i (range 2 d.max-index)))
           (let ((vlist [keep-if (op eql i @1.index) d.variants]))
             (put-string
               (if (null vlist)
                 ",*"
                 `,@{(mapcar (ret `@{@1.word [plen..:]}`) vlist) ";"}`))))
         (put-line))))

执行命令

$ txr stems.txr data
adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers

注意略有差异:

virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
                    ^

此分号不包含在原始所需输出中;没有给出排除基础的理由,所以现在它被视为印刷错误。

关于所选材料的简短互动讲座:(pos-if (op < 1) (mapcar (opip uniq length) (transpose vs))。这是计算列表vs中字符串之间公共前缀长度的逻辑:

$ txr -i
1> (defvar vs '("catalog" "category" "catamaran" "catharsis"))
vs
2> vs
("catalog" "category" "catamaran" "catharsis")
3> (transpose vs)
("cccc" "aaaa" "tttt" "aeah" "lgma" "ooar" "grrs")
4> [mapcar uniq *3]
("c" "a" "t" "aeh" "lgma" "oar" "grs")
5> [mapcar length *4]
(1 1 1 3 4 3 3)
6> (pos-if (op < 1) *5)
3

第二版,没有数据结构。在data上生成相同的输出:

@(repeat)
@  (all)
@word;@(skip)
@  (and)
@    (collect :gap 0)
@word;@variant;@type;@strindex
@      (bind index @(toint strindex))
@    (end)
@    (do
       (let* ((plen (or (pos-if (op < 1) (mapcar (opip uniq length)
                                                 (transpose variant)))
                        (length word)))
              (prefix [word 0..plen])
              (max-index [find-max index])
              (v-i-pairs (zip variant index)))
        (put-string `@word;@prefix; `)
        (each ((i (range 2 max-index)))
          (let ((vlist [keep-if (op eql i (second @1)) v-i-pairs]))
            (put-string
              `,@{(if vlist
                    (mapcar (aret `@{@1 [plen..:]}`) vlist)
                    '("*")) ";"}`)))
        (put-line)))
@  (end)
@(end)

Pure TXR Lisp解决方案,不使用提取语言。一个巨大的表达式,它读取输入行,将它们分开,将第四个字段转换为整数,按照它们的根词对条目进行分组等等:

[(opip
   (get-lines)
   (mapcar (chain (op split-str @1 ";")
                  (ap list @1 @2 @3 (toint @4))))
   (partition-by first)
   (mapcar transpose)
   (mapdo (tb ((word variant type index))
            (let* ((root (first word))
                   (plen (or (pos-if (op < 1) (mapcar (opip uniq length)
                                                      (transpose variant)))
                         (length root)))
                   (prefix [root 0..plen])
                   (max-index [find-max index])
                   (v-i-pairs (zip variant index)))
              (put-string `@root;@prefix; `)
              (each ((i (range 2 max-index)))
                (let ((vlist [keep-if (op eql i (second @1)) v-i-pairs]))
                  (put-string
                    `,@{(if vlist
                          (mapcar (aret `@{@1 [plen..:]}`) vlist)
                          '("*")) ";"}`)))
              (put-line)))))]

执行命令

$ txr stems3.tl < data
adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers

答案 3 :(得分:0)

这是我获得预期结果的代码。 代码中的注释表示glenn代码的主要更改。

BEGIN {
FS=OFS=";"
}

{ 
    words[$1";"$3] = words[$1";"$3] FS $2;
    num[$1";"$3]=num[$1";"$3] $4 FS; #Array to store numbers in the fourth column by two ID's
}

END {
    for (item in words) {
        sub("^"FS, "", words[item]);
        words_n = split(words[item], extrac);
        split(num[item],numbers); #Extract numbers one by one, in order to compare them.
        split(item,cab,";");
        long = extract_stem(extrac, words_n);

        suffix = "";
        sep = ",";

        for (i=1; i<=words_n; i++)
        {
            suf=substr(extrac[i],long+1)
            if(suf!="") #Avoid null values from suffixes.
            {
                suffix = suffix sep suf;
            }

            if(numbers[i]==numbers[i+1]) #Compare numbers with the next number
            {
                sep=";";
            }
            else if((numbers[i+1]-numbers[i])!= 1) #Subtract numbers to its previous number
            {
                sep=",*,";
            }
            else
            {
                sep=",";
            }
        }
        print cab[1], substr(extrac[1], 1, long), " "suffix
    }
}


function extract_stem(wrd, nmr ,i,j,min, chr) { #This is the magic of glenn jackman!
    min = length(wrd[1])
    for (i=2; i<=nmr; i++)
    {
        if (length(wrd[i]) < min)
        {
            min = length(wrd[i]);
        }
    }

    for (i=1; i <= min; i++)
    {
        chr = substr(wrd[1], i, 1)
        for (j=2; j <= nmr; j++)
        {
            if (substr(wrd[j], i, 1) != chr)
            {
                return i-1;
            }
        }
    }
    return min
}

我不得不修改代码。我没有考虑过这种诡辩。 当动词和副词的引理相同时。

abe;abe;sb.;1
abe;aben;sb.;2
abe;abens;sb.;3
abe;abes;sb.;4
abe;aberne;sb.;5
abe;abernes;sb.;6
abe;aber;sb.;7
abe;abers;sb.;8
abe;abe;vb.;1
abe;ab;vb.;2
abe;abet;vb.;3
abe;aber;vb.;4
abe;abede;vb.;6
abe;abes;vb.;7
abe;abedes;vb.;8