Question

我正在尝试解析我正在编写的模拟代码的自定义输入文件。它由嵌套的＆＃34;对象＆＃34;组成。具有属性，值（参见链接）。

这是an example file and the regex I am using currently。

([^:#\n]*):?([^#\n]*)#?.*\n

使得每个匹配是一条线，具有两个捕获组，一个用于属性，一个用于其值。它也排除了＆＃34;＃＆＃34;和＆＃34;：＆＃34;来自字符集，因为它们分别对应于注释分隔符和属性：值分隔符。

如何修改我的正则表达式以便递归地匹配结构？也就是说，如果第n + 1行的标识级别高于第n行，则应将其匹配为第n行匹配的子组。

我正在使用Octave，它使用PCRE正则表达式格式。

Answer 1

我问你是否可以控制数据格式，因为实际上，使用YAML而不是正则表达式来解析数据非常容易。

唯一的问题是对象形成不良：

1）以regions对象为例，它有许多名为layer的属性。我认为你的目的是建立一个layer的列表，而不是许多具有相同名称的属性。

2）现在考虑具有相应值的每个layer属性。每个layer之后都是我认为属于每个图层的孤立属性。

考虑到这些想法。如果按照YAML规则形成对象，则解析它将是一件轻而易举的事。

我知道您正在使用Octave，但考虑我对您的数据所做的修改，以及解析它是多么容易，在本例中使用python。

您现在拥有的数据

case    : 
    name    : tandem solar cell
    options :
        verbose : true
        t_stamp : system
    units   :
        energy  : eV
        length  : nm
        time    : s
        tension : V
        temperature: K
        mqty    : mole
        light   : cd
    regions :
        layer   : Glass
            geometry:
                thick   : 80 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt
        layer   : FTO
            geometry:
                thick   : 10 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt

修改数据以符合YAML SYNTAX

case    : 
    name    : tandem solar cell
    options :
        verbose : true
        t_stamp : system # a sample comment
    units   :
        energy  : eV
        length  : nm
        time    : s
        tension : V
        temperature: K
        mqty    : mole
        light   : cd
    regions : 
        -   layer   : Glass # ADDED THE - TO MAKE IT A LIST OF LAYERS
            geometry :      # AND KEEP INDENTATION PROPERLY
                thick   : 80 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt
        -   layer   : FTO
            geometry:
                thick   : 10 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt

只有这些指令才能解析您的对象：

import yaml
data = yaml.load(text)

""" your data would be parsed as:
{'case': {'name': 'tandem solar cell',
          'options': {'t_stamp': 'system', 'verbose': True},
          'regions': [{'geometry': {'npoints': 10, 'thick': '80 nm'},
                       'layer': 'Glass',
                       'optical': {'nk_file': 'vacuum.txt'}},
                      {'geometry': {'npoints': 10, 'thick': '10 nm'},
                       'layer': 'FTO',
                       'optical': {'nk_file': 'vacuum.txt'}}],
          'units': {'energy': 'eV',
                    'length': 'nm',
                    'light': 'cd',
                    'mqty': 'mole',
                    'temperature': 'K',
                    'tension': 'V',
                    'time': 's'}}}

"""

递归使用正则表达式考虑缩进级别

1 个答案: