Question

我需要在python中使用正则表达式的帮助。

我有一个大的html文件[大约400行]，具有以下模式

text here(div,span,img tags)

<!-- 3GP||Link|| --> 

text here(div,span,img tags)

所以，现在我正在寻找一个可以解雇我的正则表达式 - ：

Link

给定的模式在html文件中是唯一的。

Answer 1

>>> d = """
... Some text here(div,span,img tags)
...
... <!-- 3GP||**Some link**|| -->
...
... Some text here(div,span,img tags)
... """
>>> import re
>>> re.findall(r'\<!-- 3GP\|\|([^|]+)\|\| --\>',d)
['**Some link**']

r''是原始文字，它停止对标准字符串转义的解释
\<!-- 3GP\|\|是<!-- 3GP||
([^|]+)将匹配|的所有内容并为方便起见将其分组
\|\| --\>是|| -->
re.findall返回字符串中re pattern的所有非重叠匹配，如果re模式中有一个组表达式，则返回该匹配。

Answer 2

import re
re.match(r"<!-- 3GP\|\|(.+?)\|\| -->", "<!-- 3GP||Link|| -->").group(1)

收益"Link"。

Answer 3

如果您需要解析其他内容，您还可以将正则表达式与BeautifulSoup结合使用：

import re
from BeautifulSoup import BeautifulSoup, Comment

soup = BeautifulSoup(<your html here>)
link_regex = re.compile('\s+3GP\|\|(.*)\|\|\s+')
comment = soup.find(text=lambda text: isinstance(text, Comment)
                    and link_regex.match(text))
link = link_regex.match(comment).group(1)
print link

请注意，在这种情况下，常规表达式只需要匹配注释内容，因为BeautifulSoup已经负责从注释中提取文本。

使用REGEX在模式之间提取文本

3 个答案: