Question

我正在使用西班牙字典，其定义如下：

l. a. c. Buitre, alimoche. adj. Persona alelada. (Cornago). GOICOECHEA. // 2. f. Persona torpe, despistada e irreflexiva. // 3. Estar mirando a los abantos. fr. fig. Ser despistado, soñador, no apercibirse de la realidad. Autol. RUIZ. // 4. f. esto es una prueba

以下规则适用：

每个定义可以包含以下类别中的一个（且绝不会超过一个）：
- l. a. c.
- f.
- m.
该类别始终位于定义的开头
第一个定义从开头开始，如果有更多定义，则以\\ n.开头，其中'n'是一个数字（可能超过一位数）

对于我给出的示例，应解析以下定义：

（类别：l.a.c.）Buitre，alimoche。 ADJ。 Persona alelada。（科尔纳戈）。 GOICOECHEA
（类别：f.）Persona torpe，despistada e irreflexiva。
（无类别）Estar mirando a los abantos。 FR。图。 Ser despistado，soñador，没有apercibirse de la realidad。 AUTOL。 RUIZ。
（类别：f.）esto es una prueba

我正在尝试制作一个正则表达式来捕获每个定义（即0或1类+含义）。这就是我所拥有的

(?:(m\.|l\. a\. c\.|f\.) )?(.*?) (?:$|(?:\/\/ \d+. (?:(m\.|l\. a\. c\.|f\.) )?(.*?))+)

我正在测试它here这就是我写它的方式：

(?:
    (m\.|l\. a\. c\.|f\.)  <-- First: unnamed group containing the named group 
                                      for the category  and one space
)?
(.*?)                      <-- Named group for the meaning
(?:                        <-- Unnamed group for end of line OR another definition
   $                       <--- (end of line)
   |                       <--- (OR)
   (?:\/\/ \d+.            <--- (Definition separator & number)
       (?:(m\.|l\. a\. c\.|f\.) )?(.*?) <-- Another definition
   )+                                   <-- There may be more than one definition, so we add '+'
)

我有几个问题：

我不确定为什么它不起作用。似乎最后一个捕获组(.*?)在下一个\\之前不会扩展。我该如何解决？
小组(m\.|l\. a\. c\.|f\.)应该更大（有更多类别）我怎样才能避免重复？
我给的正则表达式字符串中有一些重复，我怎么能避免这种情况？

这是我的第一个非平凡的正则表达式示例，因此欢迎任何其他关于风格或一般改进的文章。

我的主要问题是 为什么我的正则表达式无效。 （这只是为了澄清...）

Answer 1

问题是最后一个捕获组是非贪婪的。

(?:
    (m\.|l\. a\. c\.|f\.)
)?
(.*?)
(?:
   $
   |
   (?:\/\/ \d+.
       (?:(m\.|l\. a\. c\.|f\.) )?
       (.*?) <-- this is non-greedy.
   )
)+

因此，它只会匹配空字符串。模式末尾的+没有做任何事情，因为它已匹配一次，而且足以停止。

修复很简单：强制模式匹配整行。只需在最后添加$。

(?:(m\.|l\. a\. c\.|f\.) )?(.*?) (?:$|(?:\/\/ \d+. (?:(m\.|l\. a\. c\.|f\.) )?(.*?)))+$

编辑：使用单个正则表达式捕获每个类别和定义是不可能的。如果您使用单个模式匹配整个字符串，则每个捕获组将仅包含与 last 匹配的文本，因此您只能解析最后一个定义。

您可以使用此模式匹配单个定义。

(?:^| \/\/ \d\. )(?:(?P<category>m\.|l\. a\. c\.|f\.) )?(?P<definition>.*?)(?:$|(?= \/\/ \d\.))

将其应用于字符串，直到找不到匹配项来捕获所有定义。

while (matcher.find()){
   ... do something
}

Demo.

模式的详细说明：

(?:
    ^ // match start of string
| // OR
     \/\/ \d\. // "\\ " literally, followed by a digit, a dot, and a space
)
(?:
    (?P<category> // in the named group "category", capture...
        m\.|l\. a\. c\.|f\. // one of "m.", "l. a. c.", "f."
    )  // and a space
)? // ...if possible.
(?P<definition> // in the named group "definition", capture...
    .*? // everything up to...
)
(?:
    $ // the end of the string
| // OR
    (?= // the start of the next definition. This needs to be enclosed in a lookahead assertion so as not to consume it.
         \/\/ \d\.
    ) 
)

如何修复此正则表达式（匹配字典条目）

1 个答案: