Question

我正在使用专有软件，并且必须处理该软件组件之间的长时间/乏味分析。

为了提高我在此任务中的工作效率，我使用了软件生成的报告，并试图解析它，我的赌注是Perl非常适合它。

报告看起来像（删除页码后）：

Category one: NAME1    - Some free form text description goes here
    Used by Cat2 Resources:
        CAT2_NAME - Anoter free form text description here (there are lots of them, but they are pretty much useless, since no one cares about them, probably could not be that long.)
        CAT2_NAME2 - And so on.
        CAT2_NAME4 - U guessed it!
    Uses Resource Cat4:
        CAT4_NAMED   - A meaningless description that where copied from an unrelated resource (Save as...)


Category one: NAME7    - Description
    Used by Cat2 Resources:
        CAT2_NAME - Text
        CAT2_NAME5 - And so on.

        CAT2_NAME4 - U guessed it!
    Uses Resource Cat4:
        CAT4_NAME_  - Some names don't make any sense.

Category TWO: NAME7    - Description of another Category
    Used by Cat3 Resources:
        CAT3_NAME - Text
        CAT3_NAME5 - And so on.
        CAT3_NAME4 - U guessed it!
    Uses Resource Cat4:
        CAT4_NAME_  - Some names don't make any sense.

要完全清楚

以大写字母，数字和下划线命名的所有元素名称（＆＃34; _＆＃34;）
几乎所有元素都彼此相关
有孤儿元素
没有一个名字包括他们的类别（我把它们放在我的例子上以使它们更具可读性）
段落/子段落下每行开头的空格，其中包含一个或两个TAB
这里和那里有一些随机的空行，我打算稍微清理一下这个文件，但我现在很匆忙。

我希望能够生成如下内容：

Out_CAT1_CAT2.csv

NAME1,CAT2_NAME
NAME1,CAT2_NAME2
NAME1,CAT2_NAME4
NAME7,CAT2_NAME
NAME7,CAT2_NAME5
NAME7,CAT2_NAME4

Out_CAT1_CAT4.csv

NAME1,CAT4_NAMED
NAME7,CAT2_NAME4
NAME7,CAT4_NAME_

Out_CAT2_CAT3.scv

NAME7,CAT3_NAME
NAME7,CAT3_NAME5
NAME7,CAT3_NAME4

Out_CAT2_CAT4.scv

NAME7,CAT4_NAME_

为了解析这个文件，我尝试了（并且失败了）第一种方法，包括抓取一个完整的段落（那个以类别/再次开始，没有＆＃39;类别＆＃39;在该标签上，仅类别名称，如数据库/处理模型等）

方法1

我尝试使用多行正则表达式，如/(<?=^Category one :)[A-Z0-9_]+.*+$(^\s.*$)+/m打算将一个完整的段落捕获到一个数组（或者最好是一个数组到每个级别一个类别），但在https://regex101.com/尝试了很多组合没有任何幸运。

我的目标是创建一个这样的Cat1段落的数组，然后我将用子程序解析。但是我失败了（请在评论中给我一些建议。）

我转向了一种完全不同的方法，我沿着

的方向写了一些东西

方法2

while(<>){
    if(/^Category one: /){
        $mode = CAT1PARSING;
        # Used regex to grab the name as it came in this same line after the colon.
        $cat1Name = /regex/;
    }
    elsif(/^Category TWO: /){
        $mode = CAT2PARSING;
    }
    ...

    if($mode == CAT1PARSING){
        # Used some regex to capture the name and description as elements of an array
        push @cat1Array, ($el1, $el2) = $_ =~ (/regex/);
    }
    ...
}

＃这里我做一些格式化将相同的信息转储到每个类别/子类别对的许多CSV文件中，并使用适当的标题

我做了一些格式化，将相同的信息转储到每个类别/子类别对的许多CSV文件中，并使用适当的标题

我的真实程序是使用方法2，设置两个控制变量$ mode和$ subMode（我很幸运，只有两个这样的级别），但我不满意。

我不确定它是否是所谓的状态机＆＃39;，任何人都可以确认？

所以我当然不是问一个问题，而是我的主要问题是：

有什么方法可以用regex实现这个？正如方法中所述？怎么样？

Answer 1

解析ad-hoc格式的规则1＆＃39;是：找到字段分隔符。

在这种情况下 - 你已经获得了＃34;类别＆＃34;在线的开头。如果你有一个合适的空白行＆＃39;然后你可以使用段落模式，但你不能，所以：

local $/ = "\nCategory";

匹配它，没有空格。

规则二 - 查找记录中的结构。看起来你有＆＃39;使用＆＃39;和＆＃39;使用＆＃39;段，以及那些 - 键值对，用-分隔符缩进。你需要维持这些的订购吗？因为如果没有，那么哈希是有意义的，如果是，那么你需要一个数组。

然后是第三个问题 - 输出需要什么样？你提到了CSV，但是......如果你的数据是分层的（就像它那样），那么像JSON这样的东西可能会做得更好。

无论如何，解析看起来像是：

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use JSON;

my @records;

local $/ = "\nCategory";
while (<>) {

   print "Record $. looks like: \n\n";
   print;
   print "\n\n----\n\n";


   my ( $used_cat, $used_text, $uses_cat, $uses_text ) =
     m/^\s{4}Used by (\w+) Resources:(.*)Uses Resource (\w+):(.*)/gms;


   print "::\n";
   print $used_text;
   print "\n:::\n";

   my $this_record;
   $this_record->{used_cat} = $used_cat;
   $this_record->{uses_cat} = $uses_cat;

   for ( split /\n/, $used_text ) {

      if ( my ( $key, $value ) = m/^\s+(\w+)\s*-\s*(.*)$/ ) {
         print "$key => $value\n";
         $this_record->{used}{$key} = $value;
      }
   }

   for ( split /\n/, $uses_text ) {

      if ( my ( $key, $value ) = m/^\s+(\w+)\s*-\s*(.*)$/ ) {
         print "$key => $value\n";
         $this_record->{uses}{$key} = $value;
      }
   }

   push @records, $this_record;
}

print Dumper \@records;
print "JSON Output:\n";
print to_json ( \@records, { pretty => 1 } );

使用Perl / Regex解析结构化的＆＃39;转储

方法1

方法2

我做了一些格式化，将相同的信息转储到每个类别/子类别对的许多CSV文件中，并使用适当的标题

1 个答案: