Question

我有一个由某个工具生成的CSS文件，它的格式如下：

@font-face {
    font-family: 'icomoon';
    src:url('fonts/icomoon.eot?4px1bm');
    src:url('fonts/icomoon.eot?#iefix4px1bm') format('embedded-opentype'),
        url('fonts/icomoon.woff?4px1bm') format('woff'),
        url('fonts/icomoon.ttf?4px1bm') format('truetype'),
        url('fonts/icomoon.svg?4px1bm#icomoon') format('svg');
    font-weight: normal;
    font-style: normal;
}

[class^="icon-"], [class*=" icon-"] {
    font-family: 'icomoon';
    speak: none;
    font-style: normal;
    font-weight: normal;
    font-variant: normal;
    text-transform: none;
    line-height: 1;

    /* Better Font Rendering =========== */
    -webkit-font-smoothing: antialiased;
    -moz-osx-font-smoothing: grayscale;
}

.icon-pya:before {
    content: "\e60d";
}
.icon-pyp:before {
    content: "\e60b";
}
.icon-tomb:before {
    content: "\e600";
}
.icon-right:before {
    content: "\e601";
}

我想在Python中使用正则表达式来提取每个以.icon-开头的CSS选择器及其相关值，例如：

{key: '.icon-right:before', value: 'content: "\e601";'}

我只有基本的正则表达式知识，所以我写这个：\^.icon.*\，但它只能匹配键，而不是值。

Answer 1

如果你正在使用Python，这个正则表达式可以工作：

(\.icon-[^\{]*?)\s*\{\s*([^\}]*?)\s*\}

示例：

>>> css = """
... /* ... etc ... */
... .icon-right:before {
...     content: "\e601";
... }
... """
>>> import re
>>> pattern = re.compile(r"(\.icon-[^\{]*?)\s*\{\s*([^\}]*?)\s*\}")
>>> re.findall(pattern, css)
[
    ('.icon-pya:before', 'content: "\\e60d";'),
    ('.icon-pyp:before', 'content: "\\e60b";'),
    ('.icon-tomb:before', 'content: "\\e600";'),
    ('.icon-right:before', 'content: "\\e601";')
]

然后您可以轻松地将其转换为字典：

>>> dict(re.findall(pattern, css))
{
    '.icon-right:before': 'content: "\\e601";',
    '.icon-pya:before': 'content: "\\e60d";',
    '.icon-tomb:before': 'content: "\\e600";',
    '.icon-pyp:before': 'content: "\\e60b";'
}

这通常是比{'key': ..., 'value': ...}字典序列更合理的数据结构 - 如果你必须拥有后者，我会假设你有足够的Python来弄清楚如何获得它。

好的，这是一个非常复杂的正则表达式，所以一点一点地把它拿走：

(\.icon-[^\{]*?)

这是第一个捕获组，由括号分隔。在这些内容中，我们获得\.icon-，然后是[^\{]*? - 这是一个0或更多（*）的序列，但尽可能少（?）除了＆＃39; {＆＃39; （[^\{]）。

然后，有一个未捕获的部分：

\s*\{\s*

这意味着任何数量的空白（\s*），然后是＆＃39; {＆＃39; （\{），后跟任意数量的空格（\s*）。

接下来，我们的第二个捕获组再次括在括号中：

([^\}]*?)

...这是0或更多（*）但尽可能少（?）除了＆＃39;}之外的任何事情。（[^\}]）。

最后，最后一个未捕获的部分：

\s*\}

...这是任何数量的空白（\s*），其后是＆＃39;}＆＃39; （\}）。

如果您想知道，使用*?（0或更多，但尽可能少 - 称为非贪婪匹配）的原因是匹配\s*（任意数量的空白）可以消耗尽可能多的空白，并且它不会在捕获的组内部结束。

Answer 2

根据您当前的内容，此正则表达式将起作用：

(\.icon-[^\s{]+)\s*{\s*([^;]*;)

参见demo（查看底部的替换）

该名称将被捕获到第1组，而规则将被捕获到第2组。

要以您指定的格式输出，您有几个选项。

例如，调整正则表达式轻微并替换为

{key: '\1', value: '\2' }

这假定每组大括号只有一条规则。

更好的选择是找到所有匹配，然后为每个匹配输出所需的字符串，从第1组和第2组捕获连接。

这是一个开始：

reobj = re.compile(r"(\.icon-[^\s{]+)\s*{\s*([^;]*;)")
for match in reobj.finditer(subject):
    # Group 1: match.group(1)
    # Group 2: match.group(2)

如何编写正则表达式从CSS文件中提取特定的键格式和值？

2 个答案: