我有大量的文本(100000字)要解析,并且它有以下格式
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix
我需要将其转换为此格式
abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) ity:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) ist:suffix))
adaptable (S (adapt:stem) able:suffix))
addiction (S (addict:stem) ion:suffix))
我使用的awk代码是
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"( S\\1:stem)", "g");
while ( a ~ /stem)<>.*:suffix/) {
a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
}
while ( a ~ /<>/) {
a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(S\\1\\2)", "g", a);
}
print a;}
此代码无法生成所需的输出并仅生成5个令牌的结果。
答案 0 :(得分:1)
请查看this:
#!/usr/bin/perl
# provide data
$t = <<'EOT';
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix
EOT
# iterate over lines
foreach $line (split /\n/, $t) {
# split the line
($word, $def) = split /\s+/, $line, 2;
@parts = split /\<\>/, $def;
# loop over attributes
$new = '';
for ($pos = 0; $pos<$#parts; $pos++) {
$new =
$new eq '' ?
qq[(S ($parts[0]) ($parts[1]))] : # create new entry
qq[(S $new ($parts[$pos]))]; # encapsulate existing entry
}
# output
print qq($word $new\n);
}
产生
abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) (ible:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) (ist:suffix))
adaptable (S (adapt:stem) (able:suffix))
addiction (S (addict:stem) (ion:suffix))
可访问性的分组可能是相反的,但这对我来说是不可证明的,因为你的例子似乎在语法上是不正确的。
如果是这种情况,则必须从$#parts-1
循环到0
。
或者可能所有词干和所有词足够分组成S()
组。
答案 1 :(得分:1)
use v5.10;
use strict;
while( my $line = <>)
{
chomp $line;
if( $line =~ /^(\w+)\s+(.+)/)
{
my $word = $1;
my @stems = split '<>', $2;
if( @stems )
{
my $stems = sprintf '(%s)', shift @stems;
while( @stems )
{
$stems = sprintf '(S %s (%s))', $stems, shift @stems;
}
say "$word $stems";
}
}
}
答案 2 :(得分:1)
虽然示例似乎不正确,但我尝试提供awk解决方案:
cat >infile.txt <<TXT
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix
TXT
awk '
function proc(desc, p1, p2) {
if (match(desc, /^.*<>/, arr)) {
p1 = substr(desc, 1, RLENGTH - 2);
p2 = substr(desc, RLENGTH + 1);
return "S (" proc(p1) ") ("p2")";
}
return desc;
}
{
print $1, "(" proc($2) ")"
}
' infile.txt
输出:
abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) (ity:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) (ist:suffix))
adaptable (S (adapt:stem) (able:suffix))
addiction (S (addict:stem) (ion:suffix))
代码为第二个字段调用递归函数proc
。它找到了'&lt;&gt;'的最后一次出现然后格式化第一部分再次调用自身的字符串。而已。唯一的技巧是将本地p1和p2变量添加到proc
的参数列表中,以使它们真正是本地的。
答案 3 :(得分:1)
这是一个可能的<div select-list="items" ng-model="selectedValue1" ></div>
<div select-list="items2" ng-model="selectedValue2"></div>
解决方案:
awk
输出:
{
a = gensub(/([a-zA-Z]*:stem)<>([a-zA-Z]*:suffix)/,"(S (\\1) (\\2))", "1")
while ( a ~ /<>[a-zA-Z]*:suffix/) {
a = gensub(/(\(S.*)<>([a-zA-Z]*:suffix)/,"(S \\1 (\\2))", "1", a)
}
print a
}
答案 4 :(得分:1)
我认为这个Perl程序可以满足您的需求
数据样本实在太短了,你没有解释activist
,adaptable
和addiction
所需输出中不匹配的括号,但我编写了模式我能看到
我相信你能用Perl打开一个文件?如果您在命令行中将输入文件的路径作为参数提供,那么您只需将<DATA>
更改为<>
即可。输出发送到STDOUT,因此如果要将其存储到文件中,只需在命令行上重定向输出
use strict;
use warnings 'all';
while ( <DATA> ) {
my ($word, $ss) = split;
my @ss = split /<>/, $ss;
while ( @ss > 1 ) {
my $s = sprintf 'S (%s) (%s)', @ss[0,1];
splice @ss, 0, 2, $s;
}
printf "%s (%s)\n", $word, $ss[0];
}
__DATA__
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix
abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) (ity:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) (ist:suffix))
adaptable (S (adapt:stem) (able:suffix))
addiction (S (addict:stem) (ion:suffix))