Question

我的第一个问题就在这里！
至关重要;

在正则表达方面，我很陌生为了更好地学习它并创建我可以实际使用的东西，我正在尝试创建一个可以在CSS文件中找到所有CSS标记的正则表达式。

到目前为止，我正在使用：

[#.]([a-zA-Z0-9_\-])*

哪个工作正常，找到#TB_window以及#TB_window img#TB_Image和.TB_Image#TB_window。

问题是它还在CSS文件中找到了十六进制代码标记。即#FFF或#eaeaea 还可以找到.png或.jpg或0.75。

实际上，找到它们是合乎逻辑的，但是不是那里有明智的解决方法吗？
喜欢在括号{..}之间排除任何内容？
（我很确定这是可能的，但我的正则表达式经验还不多）。

提前致谢！

干杯！
麦克

Answer 1

CSS是一种非常简单的常规语言，这意味着它可以被Regex完全解析。所有这些都是选择器组，每组都有一组由冒号分隔的选项。

请注意，此帖子中的所有正则表达式都应设置详细和 dotall 标志（/ s和/ x在某些语言中，re.DOTALL和re.VERBOSE in Python）的

获取（选择器，规则）对：

\s*        # Match any initial space
([^{}]+?)  # Ungreedily match a string of characters that are not curly braces.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  (.*?)    # Ungreedily match anything any number of times.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.

在属性选择器（例如img[src~='{abc}']）或规则（例如background: url('images/ab{c}.jpg')）中引用大括号的极少数情况下，这不起作用。这可以通过使正则表达式复杂化来解决：

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^{}\"\']           # Any character other than braces or quotes.
  |                   # OR
  \"                  # An opening double quote.
    (?:[^\"\\]|\\.)*  # Either a neither-quote-not-backslash, or an escaped character.
  \"                  # And a closing double quote.
  |                   # OR
  \'(?:[^\']|\\.)*\'  # Same as above, but for single quotes.
)+?)       # Ungreedily match all that once or more.
\s*        # Arbitrary spacing again.
\{         # Opening brace.
  \s*      # Arbitrary spacing again.
  ((?:[^{}\"\']|\"(?:[^\"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')*?)
           # The above line is the same as the one in the selector capture group.
  \s*      # Arbitrary spacing again.
\}         # Closing brace.
# This will even correctly identify escaped quotes.

哇，这是少数。但是如果你以模块化的方式接近它，你会发现它并不像乍一看那么复杂。

现在，要分割选择器和规则，我们必须匹配非分隔符的字符串（其中分隔符是选择器的逗号和规则的分号）或带有任何内部的引用字符串。我们将使用上面使用的相同模式。

对于选择者：

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:,|$)      # Followed by a comma or the end of a string.

对于规则：

\s*        # Match any initial space
((?:       # Start the selectors capture group.
  [^,\"\']             # Any character other than commas or quotes.
  |                    # OR
  \"                   # An opening double quote.
    (?:[^\"\\]|\\.)*   # Either a neither-quote-not-backslash, or an escaped character.
  \"                   # And a closing double quote.
  |                    # OR
  \'(?:[^\'\\]|\\.)*\' # Same as above, but for single quotes.
)+?)       # Ungreedily match all that.
\s*        # Arbitrary spacing.
(?:;|$)      # Followed by a semicolon or the end of a string.

最后，对于每个规则，我们可以在冒号上拆分（一次！）以获得属性 - 值对。

将所有这些组合成一个Python程序（正则表达式与上面相同，但非冗长以节省空间）：

import re

CSS_FILENAME = 'C:/Users/Max/frame.css'

RE_BLOCK = re.compile(r'\s*((?:[^{}"\'\\]|\"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')+?)\s*\{\s*((?:[^{}"\'\\]|"(?:[^"\\]|\\.)*"|\'(?:[^\'\\]|\\.)*\')*?)\s*\}', re.DOTALL)
RE_SELECTOR = re.compile(r'\s*((?:[^,"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:,|$)', re.DOTALL)
RE_RULE = re.compile(r'\s*((?:[^;"\'\\]|\"(?:[^"\\]|\\.)*\"|\'(?:[^\'\\]|\\.)*\')+?)\s*(?:;|$)', re.DOTALL)

css = open(CSS_FILENAME).read()

print [(RE_SELECTOR.findall(i),
        [re.split('\s*:\s*', k, 1)
         for k in RE_RULE.findall(j)])
       for i, j in RE_BLOCK.findall(css)]

对于此示例CSS：

body, p#abc, #cde, a img .fgh, * {
  font-size: normal; background-color: white !important;

  -webkit-box-shadow: none
}

#test[src~='{a\'bc}'], .tester {
  -webkit-transition: opacity 0.35s linear;
  background: white !important url("abc\"cd'{e}.jpg");
  border-radius: 20px;
  opacity: 0;
  -webkit-box-shadow: rgba(0, 0, 0, 0.6) 0px 0px 18px;
}

span {display: block;} .nothing{}

...我们得到（为清晰起见间隔）：

[(['body',
   'p#abc',
   '#cde',
   'a img .fgh',
   '*'],
  [['font-size', 'normal'],
   ['background-color', 'white !important'],
   ['-webkit-box-shadow', 'none']]),
 (["#test[src~='{a\\'bc}']",
   '.tester'],
  [['-webkit-transition', 'opacity 0.35s linear'],
   ['background', 'white !important url("abc\\"cd\'{e}.jpg")'],
   ['border-radius', '20px'],
   ['opacity', '0'],
   ['-webkit-box-shadow', 'rgba(0, 0, 0, 0.6) 0px 0px 18px']]),
 (['span'],
  [['display', 'block']]),
 (['.nothing'],
  [])]

读者的简单练习：编写正则表达式以删除CSS注释（/* ... */）。

Answer 2

这个怎么样：

([#.]\S+\s*,?)+(?=\{)

Answer 3

首先，我看不到您发布的RE如何找到.TB_Image#TB_window。你可以这样做：

/^[#\.]([a-zA-Z0-9_\-]*)\s*{?\s*$/

这会在行的开头找到#或.，然后是标记，可选地后跟{，然后是换行符。

请注意，这不适用于.TB_Image { something: 0; }（全部在一行）或div.mydivclass等行，因为.不在行的开头。

编辑：我不认为CSS中允许使用嵌套大括号，因此如果您读入所有数据并删除换行符，则可以执行以下操作：

/([a-zA-Z0-9_\-]*([#\.][a-zA-Z0-9_\-]+)+\s*,?\s*)+{.*}/

有一种方法可以告诉正则表达式忽略换行符，但我似乎从来没有这样做过。

Answer 4

使用正则表达式解决这个问题实际上不是一件容易的事，因为有很多可能性，请考虑：

后代选择器，如#someid ul img - 这些都是有效的标签，并以空格分隔
不以.或#开头的代码（即HTML代码名称） - 您必须提供这些代码的列表才能匹配它们，因为它们与属性没有其他区别
评论
我现在想不到的更多

我认为您应该考虑一些适合您首选语言的CSS解析库。

尝试从正则表达式结果中删除十六进制代码

4 个答案: