在patteren中打印文本文件

时间:2016-03-22 12:58:28

标签: regex perl awk sed

我有大量的文本(100000字)要解析,并且它有以下格式

abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix

我需要将其转换为此格式

abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) ity:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) ist:suffix))
adaptable (S (adapt:stem) able:suffix))
addiction (S (addict:stem) ion:suffix))

我使用的awk代码是

awk 'BEGIN{FS=OFS="\n"}{
   a=gensub(/([a-zA-Z]*):stem/,"( S\\1:stem)", "g");
   while ( a ~ /stem)<>.*:suffix/) {
     a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
   }
   while ( a ~ /<>/) {
     a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(S\\1\\2)", "g", a);
   }
   print a;}

此代码无法生成所需的输出并仅生成5个令牌的结果。

5 个答案:

答案 0 :(得分:1)

请查看this

#!/usr/bin/perl

# provide data
$t = <<'EOT';
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix
EOT

# iterate over lines
foreach $line (split /\n/, $t) {

    # split the line
    ($word, $def) = split /\s+/, $line, 2;
    @parts = split /\<\>/, $def;

    # loop over attributes
    $new = '';
    for ($pos = 0; $pos<$#parts; $pos++) {
            $new = 
                $new eq '' ?
                qq[(S ($parts[0]) ($parts[1]))] :  # create new entry
                qq[(S $new ($parts[$pos]))];       # encapsulate existing entry
    }

    # output
    print qq($word $new\n);
}

产生

abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) (ible:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) (ist:suffix))
adaptable (S (adapt:stem) (able:suffix))
addiction (S (addict:stem) (ion:suffix))

可访问性的分组可能是相反的,但这对我来说是不可证明的,因为你的例子似乎在语法上是不正确的。

如果是这种情况,则必须从$#parts-1循环到0

或者可能所有词干和所有词足够分组成S()组。

答案 1 :(得分:1)

use v5.10;
use strict;

while( my $line = <>)
{
    chomp $line;
    if( $line =~ /^(\w+)\s+(.+)/)
    {
        my $word = $1;
        my @stems = split '<>', $2;

        if( @stems )
        {
            my $stems = sprintf '(%s)', shift @stems;
            while( @stems )
            {
                $stems = sprintf '(S %s (%s))', $stems, shift @stems;
            }
            say "$word $stems";
        }
    }
}

答案 2 :(得分:1)

虽然示例似乎不正确,但我尝试提供解决方案:

cat >infile.txt <<TXT
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix
TXT

awk '
function proc(desc,    p1, p2) { 
  if (match(desc, /^.*<>/, arr)) {
    p1 = substr(desc, 1, RLENGTH - 2);
    p2 = substr(desc, RLENGTH + 1);
    return "S (" proc(p1) ") ("p2")";
  } 

  return desc;
}

{
  print $1, "(" proc($2) ")"
}
' infile.txt

输出:

abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) (ity:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) (ist:suffix))
adaptable (S (adapt:stem) (able:suffix))
addiction (S (addict:stem) (ion:suffix))

代码为第二个字段调用递归函数proc。它找到了'&lt;&gt;'的最后一次出现然后格式化第一部分再次调用自身的字符串。而已。唯一的技巧是将本地p1和p2变量添加到proc的参数列表中,以使它们真正是本地的。

答案 3 :(得分:1)

这是一个可能的<div select-list="items" ng-model="selectedValue1" ></div> <div select-list="items2" ng-model="selectedValue2"></div> 解决方案:

awk

输出:

{
    a = gensub(/([a-zA-Z]*:stem)<>([a-zA-Z]*:suffix)/,"(S (\\1) (\\2))", "1")
    while ( a ~ /<>[a-zA-Z]*:suffix/) {
        a = gensub(/(\(S.*)<>([a-zA-Z]*:suffix)/,"(S \\1 (\\2))", "1", a)
    }
    print a
}

答案 4 :(得分:1)

我认为这个Perl程序可以满足您的需求

数据样本实在太短了,你没有解释activistadaptableaddiction所需输出中不匹配的括号,但我编写了模式我能看到

我相信你能用Perl打开一个文件?如果您在命令行中将输入文件的路径作为参数提供,那么您只需将<DATA>更改为<>即可。输出发送到STDOUT,因此如果要将其存储到文件中,只需在命令行上重定向输出

use strict;
use warnings 'all';

while ( <DATA> ) {
    my ($word, $ss) = split;
    my @ss = split /<>/, $ss;

    while ( @ss > 1 ) {
        my $s = sprintf 'S (%s) (%s)', @ss[0,1];
        splice @ss, 0, 2, $s;
    }

    printf "%s (%s)\n", $word, $ss[0];
}


__DATA__
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
accessibility access:stem<>ible:suffix<>ity:suffix
accretion accrete:stem<>ion:suffix
activist active:stem<>ist:suffix
adaptable adapt:stem<>able:suffix
addiction addict:stem<>ion:suffix

输出

abrasion (S (abrade:stem) (ion:suffix))
abstainer (S (abstain:stem) (er:suffix))
abstention (S (abstain:stem) (ion:suffix))
accessibility (S (S (access:stem) (ible:suffix)) (ity:suffix))
accretion (S (accrete:stem) (ion:suffix))
activist (S (active:stem) (ist:suffix))
adaptable (S (adapt:stem) (able:suffix))
addiction (S (addict:stem) (ion:suffix))