Question

原文：

为简单起见，这就是我要完成的工作：

原始

：

[category - subcategory] [some text - more text] [2018-12-31] text title here

所需结果：

category
subcategory
some text
more text
2018-12-31
text title here

方括号的数量始终是相同的，但是方括号之间包含的属性的数量可能不同：

[category - subcategory] [some text - more text] [2018-12-31] text title here

[category - subcategory] [some text] [2018-12-31] text title here more text

[category] [some text - more text - even more] [2018-12-31] text title here more text

因此前两个[] []中的文本将由-分隔。

昨天我第一次尝试使用regexp，这有点令人头疼。我想做的事有可能吗？

Answer 1

我将分两步解决这个问题。

首先，使用此正则表达式提取方括号之间和之后的块：

\[(.*?)\]\s*\[(.*?)\]\s*\[(.*?)\]\s*(.*)

假定输入中的其他地方不允许使用方括号，这将为您提供四个类别匹配项，用于类别，文本，日期和自由文本。

详细信息：

\[和\]匹配文字方括号。
(.*?)以非贪婪的方式匹配方括号之间的文本，从而避免了使用笨拙的字符集([^][]*)来排除它们。
\s*允许块之间任意数量的空格。如果模式始终只是一个空格，则也可以只使用一个空格。
(.*)最后将仅获取行中剩余的所有内容。

然后，您可以将'-'上的类别和文本拆分为数组或列表，以包含所需的细分。由于您想在前两组括号中捕获可变数量的字段，因此，在split()可以轻松完成此工作的情况下，试图在一个大的正则表达式中捕获所有字段似乎比所需的困难。

PS：由于您未指定编程语言，因此我提供了描述性的伪代码；您将必须查找如何访问匹配组以及使用您的语言进行拆分。

DEMO

Answer 2

是的，有可能，但是有些表达式可能很复杂，类似于：

\[\s*(\s*\d{4}\s*-\s*\d{2}\s*-\s*\d{2}\s*)\s*\]|(?<=\[|-)\s*(.*?)\s*(?=-|\])|([A-Za-z].*)

我们将首先使用

捕获日期

\[\s*(\s*\d{4}\s*-\s*\d{2}\s*-\s*\d{2}\s*)\s*\]

然后使用

将另一个所需的子字符串放在另一个方括号中

(?<=\[|-)\s*(.*?)\s*(?=-|\])

以及最后一句话：

([A-Za-z].*)

例如。我们可以将其他字符添加到该字符类中

[A-Za-z]

如果需要的话。

如果要浏览/简化/修改该表达式，请在this demo的右上角进行解释。

演示

在此演示中，您可以查看捕获组的工作方式：

const regex = /\[\s*(\s*\d{4}\s*-\s*\d{2}\s*-\s*\d{2}\s*)\s*\]|(?<=\[|-)\s*(.*?)\s*(?=-|\])|([A-Za-z].*)/gm;
const str = `[category - subcategory] [some text   -   more text  ] [2018-12-31] text title here
[category - subcategory] [some text] [  2018 - 12 -31  ] text title here more text
[category] [some text - more text - even more] [2018-12-31] text title here more text
[category] [some text - more text - even more - some text - more text   -   even more  ] [2018-12-31] text title here more text`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Answer 3

您还可以应用sed以所需的格式获取结果

echo [category - subcategory] [some text - more text] [2018-12-31] text title here \
| sed -e $'s/\] /\\\n/g' -e $'s/ \- /\\\n/g' -e 's/\[//g'

输出：

 category
 subcategory
 some text
 more text
 2018-12-31
 text title here

首先将](space)和(space)-(space)转换为新行，然后将[替换为empty

Answer 4

尝试模式\[.+?(?(?<= - ) - |\])

说明：

\[-从字面上匹配[

.+?-匹配任意一个或多个字符（非贪婪）

(?(?<= - ) - |\])-有条件的：如果满足正向外观(?<= - )（字面上匹配-），则匹配-，否则匹配]与{{ 1}}

enter link description here

Answer 5

帮自己一个忙，编写自己的解析器，例如使用Python（尚未标记语言？），则可能是parsimonious：

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

data = ["[category - subcategory] [some text - more text] [2018-12-31] text title here",
        "[category - subcategory] [some text] [2018-12-31] text title here more text",
        "[category] [some text - more text - even more] [2018-12-31] text title here more text",
        "[category - subcategory] [some text - more text] [2018-12-31] text title here"]


class TextVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        content = (section / text)+

        section = lpar notpar (sep notpar)* rpar ws*
        text    = ~"[^][]+"

        lpar    = "["
        rpar    = "]"
        notpar  = ~"(?:(?! - )[^][])+"
        sep     = " - "
        ws      = ~"\s+"
        """
    )

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_section(self, node, visited_children):
        _, cat1, catn, *_ = visited_children

        categories = [cat1.text] + [cat[1].text for cat in catn]
        return categories

    def visit_text(self, node, visited_children):
        return [node.text]

    def visit_content(self, node, visited_children):
        result = [textnode
                  for child in visited_children
                  for subchild in child
                  for textnode in subchild]
        return result


for datapoint in data:
    tv = TextVisitor()
    result = tv.parse(datapoint)
    print("\n".join(result))
    print("###")

这产生

category
subcategory
some text
more text
2018-12-31
text title here
###
category
subcategory
some text
2018-12-31
text title here more text
###
category
some text
more text
even more
2018-12-31
text title here more text
###
category
subcategory
some text
more text
2018-12-31
text title here
###

Answer 6

如果支持\G锚点在上一场比赛的末尾断言位置，则要在方括号内获取不带连字符的单独部分，您可以使用：

(?:\[|\G(?!^))([^-\][\s]+(?:[ -][^-\][\s]+)*)(?: - )?(?=[^[\]]*\])

匹配项在第一个捕获组中。

说明

(?:非捕获组
- \[匹配[
- |或
- \G(?!^)在上一场比赛而不是开始时断言位置
)关闭非捕获组
(捕获组1
- [^-\][\s]+匹配1个以上的所有字符，除了-，]，[和空白字符
- (?:[ -][^-\][\s]+)*重复0+次与以前的模式相同，只在前面加上空格或连字符
)关闭群组
(?: - )?（可选）在空格之间匹配-
(?=正向前进，断言右边是
- [^[\]]*\]匹配0+次除[和]之外的任意字符
)提前关闭

Regex demo

提取方括号内的文本（方括号内的属性之间有分隔符）

6 个答案:

DEMO

演示