Question

我有一个正则表达式，可以在div id="content"之前删除文件中的所有内容，并包括/ <div id="footer"之后

([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*)

我正在使用re模块在python中使用正则表达式。我在python中使用的代码：

file = open(file_dir)
content = file.read()
result = re.search('([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*))', content)

我也尝试过使用re.match。我无法返回我想要的内容。现在我只能让它在div＃content

之前返回所有内容

Answer 1

虽然不是advisable，但您可以提取内容而不是简单地匹配它：

import re

rx = re.compile(r'''
        .*?
        (
            <div\ id="content"
            .+?
        )
        <div\ id="footer
        ''', re.VERBOSE | re.DOTALL)

content = rx.findall(your_string_here, 1)[0]
print(content)

<小时/> 这产生了

<div id="content" class="other">
i have this other stuff 
<div>More stuff</div>

见a demo on regex101.com。更好的是：使用解析器，例如而是BeautifulSoup。

Answer 2

如果您允许我发表评论：HTML +正则表达式=疯狂。：）

HTML通常是不规则的，一些流浪角色会破坏最聪明的正则表达式。此外，许多看似HTML的网页实际上并不像HTML那样容易获得。同时，有几个可爱的加工网站产品正在不断发展，其中包括BeautifulSoup，selenium和scrapy。

>>> from io import StringIO
>>> import bs4
>>> HTML = StringIO('''\
... <body>
...     <div id="container">
...         <div id="content">
...             <span class="something_1">some words</span>
...             <a href="https://link">big one</a>
...         </div>
...     <div>
...     <div id="footer">
... </body>''')
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.find('div', attrs={'id': 'container'})
<div id="container">
<div id="content">
<span class="something_1">some words</span>
<a href="https://link">big one</a>
</div>
<div>
<div id="footer">
</div></div></div>

Answer 3

此RegEx应该有效：https://regex101.com/r/L1zzOc/1

\<div id=\"content\"[.\s\S]*?(?=\<div id=\"footer\")

您的原始代码中似乎有拼写错误，并且在第一个"之后忘记了<div id="footer>。

使正则表达式适应python re模块

3 个答案: