Question

我正在尝试使用findall()的正则表达式。我遇到的问题是模式中有一些未知数量的空白字符（空格，制表符，换行符，回车符）。

在下面的示例中，只要在findall()之后找到<D> </D>，我就会使用</A>来获取</D>内的文字。我的问题是在</D>之后有空格字符。

在下面的示例中，我需要检索Second Text。我只使用正则表达式，</D>和</A>之间没有空格。这就是我试过的：

regex = '<D>(.+?)</D></A>'

<A> 
   <B> Text </B> 
   <D> Second Text</D>
</A>

Answer 1

如果您需要匹配</D>和</A>之间的空格：

regex = r'<D>(.+?)</D>\s*</A>'

注意在python中使用r''原始字符串文字表示正则表达式，以避免在普通字符串中需要双重转义：

regex = '<D>(.+?)</D>\\s*</A>'

要使.与新行匹配，您可以使用re.DOTALL标记进行匹配

Answer 2

\. \* \\转发了特殊字符。

\t \n \r标签，换行，回车。

\u00A9 unicode转义©。

有关详细信息和测试正则表达式，请尝试使用此http://regexr.com/。

对于它的价值，在Python中，您还可以使用my_string.strip('\t')清除文本，或将其替换为my_string.replace('\t', ' ')的空格。

希望这有帮助。

Answer 3

当像GraceSoup这样的HTML解析库在创建表示HTML页面的文档对象方面做得非常好时，并不总是建议使用Pyparsing。但有时候，你不想要整个文件，你只想挑选一些片段。

在抓取网页时，正则表达式是一个非常脆弱的提取器，因为空格可以出现在令人惊讶的地方，标签有时会在您不期望它们时获得属性，大小写标记名称是可接受的，等等。 Pyparsing的辅助方法makeHTMLTags(tagname)不仅仅包含输入字符串周围的<> - 它处理所有的空格，字母大小写和属性变化，并且在完成后仍然为您提供了一个非常易读的程序。下行 - pyparsing并不是表现最快的。

请参阅输入测试中的不同示例以及找到的匹配项：

test = """\
<A> 
   <B> Text </B> 
   <D> Second Text</D>
</A>
<A> 
   <B> Text </B> 
   <d extra_attribute='something'> another Text</d>
</A>
<A> 
   <B> Text </B> 
   <D> yet another Text</D>
   \t\t
</A>
"""

from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, lineno, col

# makeHTMLTags will return patterns for both the opening and closing tags
d,d_end = makeHTMLTags('d')
a,a_end = makeHTMLTags('a')

# define the pattern you want to match
pattern = d + SkipTo(d_end, failOn=anyOpenTag)('body') + d_end + a_end

# use scanString to scan the input HTML, and get match,start,end triples
for match, loc, endloc in pattern.scanString(test):
    print '"%s" at line %d, col %d' % (match.body, lineno(loc, test), col(loc, test))

打印

"Second Text" at line 3, col 4
"another Text" at line 7, col 4
"yet another Text" at line 11, col 4

Answer 4

看起来它是xml的一部分，所以最好不要在regex使用lxml，bs4等等.BTW我试过混合方法，即首先选择{{ 1}}标记，然后在A中的D内选择文字。

输出 -

import re
#let's take a string i.e. txt that is very roughest string that even doesnot maintain rules of xml

txt = """<A> 


   <B> Text </B> 


   <D> Second Text</D>


</A> line 3\n
<S> 
   <B> Text </B> 
   <D> Second Text</D>
</A>
<E> 
   <B> Text </B> 
   <D> Second Text</D>
</A>"""

A = re.findall(r'<A>[\w\W]*?</A>',txt)
print re.findall(r'(?<=<D>)[\w\W]*?(?=</D>)',''.join(A))

请参阅HERE的演示，了解第一个表达式，HERE演示第二个正则表达式

从搜索模式中排除空格

4 个答案: