我有一个csv文件,以分号分隔。这个文件包含一个丹麦语词典,我需要从中提取词干和后缀。 我需要在AWK中执行此操作!
文件:
adelig;adelig;adj.;1
adelig;adelige;adj.;2
adelig;adeligt;adj.;3
adelig;adeligst;adj.;5
voksen;voksen;adj.;1
voksen;voksne;adj.;2
voksen;voksent;adj.;3
voksen;voksnest;adj.;5
virkemiddel;virkemiddel;sb.;1
virkemiddel;virkemidlet;sb.;2
virkemiddel;virkemidlets;sb.;3
virkemiddel;virkemiddels;sb.;4
virkemiddel;virkemidlerne;sb.;5
virkemiddel;virkemidlernes;sb.;6
virkemiddel;virkemiddel;sb.;7
virkemiddel;virkemidler;sb.;7
virkemiddel;virkemiddels;sb.;8
virkemiddel;virkemidlers;sb.;8
预期产出:
adelig;adelig; ,e,t,*,st
voksen;voks; ,ne,ent,*,nest
virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers
第四栏是表格。缺少某些表单时,后缀将替换为星号。像adelig;adelig; ,e,t,*,st
一样
如果重复表单(数字),则后缀用分号分隔。与virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers
我开始编写这段代码,但我没有让算法处理多个可能的词干。与 virkemiddel
的情况一样BEGIN{
FS=";"
}
{
lemm=$1;
form=$2;
if(match(form, lemm) > 0)
{
root=lemm;
sub(root,"",form);
suf[$1]=suf[$1]","form;
}
else
{
split($1,a,"");
split($2,b,"");
s="";
for(i in a)
{
if(b[i]!=a[i])
{
break;
}
s = s "" a[i];
}
}
root=s;
}
答案 0 :(得分:4)
这里有一些awk代码,用于查找公共前缀长度并确定后缀列表。我没有处理丢失的表格,也没有处理重复的数字,但它应该给你一个开始
#!/usr/bin/gawk -f
BEGIN { FS = OFS = ";" }
{ words[$1] = words[$1] FS $2 }
END {
for (word in words) {
sub("^"FS, "", words[word])
num_words = split(words[word], these_words)
prefix_length = common_prefix_length(these_words, num_words)
suffixes = ""
sep = ""
for (i=1; i<=num_words; i++) {
suffixes = suffixes sep substr(these_words[i],prefix_length+1)
sep = ","
}
print word, substr(these_words[1], 1, prefix_length), suffixes
}
}
function common_prefix_length(w, n ,i,j,minlen, char) {
minlen = length(w[1])
for (i=2; i<=n; i++)
if (length(w[i]) < minlen)
minlen = length(w[i])
for (i=1; i <= minlen; i++) {
char = substr(w[1], i, 1)
for (j=2; j <= n; j++)
if (substr(w[j], i, 1) != char)
return i-1
}
return minlen
}
根据您的输入,输出是
voksen;voks;en,ne,ent,nest
virkemiddel;virkemid;del,let,lets,dels,lerne,lernes,del,ler,dels,lers
adelig;adelig;,e,t,st
答案 1 :(得分:2)
这可能是Python的一个很好的起点。它使用os.path.commonprefix
从单词列表中获取词干。
import os
import csv
file="a"
prev_word=""
words=[]
data=dict()
csv_reader = csv.DictReader(
open(file),
delimiter=";",
fieldnames=['common','word','type','num']
)
for row in csv_reader:
word = row['common']
if not prev_word or word == prev_word:
words.append(row['word'])
else:
common=os.path.commonprefix(words)
data[prev_word] = words
words=[]
prev_word = word
data[prev_word] = words
for word,values in data.iteritems():
common = os.path.commonprefix(values)
suffixes = [i[len(common):] for i in values]
suffixes = [i if len(i) else '*' for i in suffixes]
print "%s;%s;%s" %(word,common,','.join(suffixes))
它返回:
voksen;voks;ne,ent,nest
virkemiddel;virkemid;let,lets,dels,lerne,lernes,del,ler,dels,lers
adelig;adelig;*,e,t,st
答案 2 :(得分:2)
TXR中的三个解决方案。首先,使用提取语言构建基于结构的显式数据模型,然后处理结构:
@(do (defstruct inflection () word type index) (defstruct dict-entry () root variants max-index)) @(collect :vars (dict)) @ (all) @word;@(skip) @ (and) @ (collect :gap 0 :vars (infl)) @word;@variant;@type;@index @ (bind infl @(new inflection word variant type type index (toint index))) @ (end) @ (bind dict @(new dict-entry root word variants infl max-index [find-max infl > (usl index)].index)) @ (end) @(end) @(do (each ((d dict)) (let* ((vs (mapcar (usl word) d.variants)) (plen (or (pos-if (op < 1) (mapcar (opip uniq length) (transpose vs))) (length d.root))) (prefix [(first vs) 0..plen])) (put-string `@{d.root};@prefix; `) (each ((i (range 2 d.max-index))) (let ((vlist [keep-if (op eql i @1.index) d.variants])) (put-string (if (null vlist) ",*" `,@{(mapcar (ret `@{@1.word [plen..:]}`) vlist) ";"}`)))) (put-line))))
执行命令
$ txr stems.txr data adelig;adelig; ,e,t,*,st voksen;voks; ,ne,ent,*,nest virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
注意略有差异:
virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers ^
此分号不包含在原始所需输出中;没有给出排除基础的理由,所以现在它被视为印刷错误。
关于所选材料的简短互动讲座:(pos-if (op < 1) (mapcar (opip uniq length) (transpose vs))
。这是计算列表vs
中字符串之间公共前缀长度的逻辑:
$ txr -i 1> (defvar vs '("catalog" "category" "catamaran" "catharsis")) vs 2> vs ("catalog" "category" "catamaran" "catharsis") 3> (transpose vs) ("cccc" "aaaa" "tttt" "aeah" "lgma" "ooar" "grrs") 4> [mapcar uniq *3] ("c" "a" "t" "aeh" "lgma" "oar" "grs") 5> [mapcar length *4] (1 1 1 3 4 3 3) 6> (pos-if (op < 1) *5) 3
第二版,没有数据结构。在data
上生成相同的输出:
@(repeat) @ (all) @word;@(skip) @ (and) @ (collect :gap 0) @word;@variant;@type;@strindex @ (bind index @(toint strindex)) @ (end) @ (do (let* ((plen (or (pos-if (op < 1) (mapcar (opip uniq length) (transpose variant))) (length word))) (prefix [word 0..plen]) (max-index [find-max index]) (v-i-pairs (zip variant index))) (put-string `@word;@prefix; `) (each ((i (range 2 max-index))) (let ((vlist [keep-if (op eql i (second @1)) v-i-pairs])) (put-string `,@{(if vlist (mapcar (aret `@{@1 [plen..:]}`) vlist) '("*")) ";"}`))) (put-line))) @ (end) @(end)
Pure TXR Lisp解决方案,不使用提取语言。一个巨大的表达式,它读取输入行,将它们分开,将第四个字段转换为整数,按照它们的根词对条目进行分组等等:
[(opip (get-lines) (mapcar (chain (op split-str @1 ";") (ap list @1 @2 @3 (toint @4)))) (partition-by first) (mapcar transpose) (mapdo (tb ((word variant type index)) (let* ((root (first word)) (plen (or (pos-if (op < 1) (mapcar (opip uniq length) (transpose variant))) (length root))) (prefix [root 0..plen]) (max-index [find-max index]) (v-i-pairs (zip variant index))) (put-string `@root;@prefix; `) (each ((i (range 2 max-index))) (let ((vlist [keep-if (op eql i (second @1)) v-i-pairs])) (put-string `,@{(if vlist (mapcar (aret `@{@1 [plen..:]}`) vlist) '("*")) ";"}`))) (put-line)))))]
执行命令
$ txr stems3.tl < data adelig;adelig; ,e,t,*,st voksen;voks; ,ne,ent,*,nest virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
答案 3 :(得分:0)
这是我获得预期结果的代码。 代码中的注释表示glenn代码的主要更改。
BEGIN {
FS=OFS=";"
}
{
words[$1";"$3] = words[$1";"$3] FS $2;
num[$1";"$3]=num[$1";"$3] $4 FS; #Array to store numbers in the fourth column by two ID's
}
END {
for (item in words) {
sub("^"FS, "", words[item]);
words_n = split(words[item], extrac);
split(num[item],numbers); #Extract numbers one by one, in order to compare them.
split(item,cab,";");
long = extract_stem(extrac, words_n);
suffix = "";
sep = ",";
for (i=1; i<=words_n; i++)
{
suf=substr(extrac[i],long+1)
if(suf!="") #Avoid null values from suffixes.
{
suffix = suffix sep suf;
}
if(numbers[i]==numbers[i+1]) #Compare numbers with the next number
{
sep=";";
}
else if((numbers[i+1]-numbers[i])!= 1) #Subtract numbers to its previous number
{
sep=",*,";
}
else
{
sep=",";
}
}
print cab[1], substr(extrac[1], 1, long), " "suffix
}
}
function extract_stem(wrd, nmr ,i,j,min, chr) { #This is the magic of glenn jackman!
min = length(wrd[1])
for (i=2; i<=nmr; i++)
{
if (length(wrd[i]) < min)
{
min = length(wrd[i]);
}
}
for (i=1; i <= min; i++)
{
chr = substr(wrd[1], i, 1)
for (j=2; j <= nmr; j++)
{
if (substr(wrd[j], i, 1) != chr)
{
return i-1;
}
}
}
return min
}
我不得不修改代码。我没有考虑过这种诡辩。 当动词和副词的引理相同时。
abe;abe;sb.;1
abe;aben;sb.;2
abe;abens;sb.;3
abe;abes;sb.;4
abe;aberne;sb.;5
abe;abernes;sb.;6
abe;aber;sb.;7
abe;abers;sb.;8
abe;abe;vb.;1
abe;ab;vb.;2
abe;abet;vb.;3
abe;aber;vb.;4
abe;abede;vb.;6
abe;abes;vb.;7
abe;abedes;vb.;8