Question

我正在尝试使用Python正则表达式删除HTML文件中的一些反应标记。部分HTML文件如下所示。

<span data-reactid="57">Price/Book</span><!-- react-text: 58 --> <!-- /react-text --><!-- react-text: 59 -->(mrq)<!-- /react-text --><sup aria-label="KS_HELP_SUP_undefined" data-reactid="60"></sup></td><td class="Fz(s) Fw(500) Ta(end)" data-reactid="61">8.36</td>

我的Python正则表达式代码如下所示。

cleandUpCode = re.sub(r'<!-- react-text: \d{1,2,3} -->', '', sourceCode)

sourceCode变量包含原始HTML源代码，其中包含所有反应垃圾。我可以运行代码，它会执行。但是当我将输出传递给文件并检查它时，所有反应垃圾标签仍然存在。

有人可以帮忙吗？

提前多多感谢。

-Frank

Answer 1

为您的代码将\ d {1,2,3}更改为\ d {1,3}。量词{1,3}重复前一项1至3次。

Regex Quantifier：http://www.rexegg.com/regex-quickstart.html#quantifiers

请检查：Python Regex Demo

<强>更新如果您要删除除特定内容之外的所有react-text，请改为使用：。

Answer 2

您只需要包含react标签中可能出现的最大位数。另外，要删除using System; using System.Diagnostics; class Program { static void Main() => new _derived(); } abstract class _base { [DebuggerBrowsable(DebuggerBrowsableState.Never)] public Object trace; }; class _derived : _base { public _derived() => Debugger.Break(); // <-- vs2017 EE crash when stopped here [DebuggerBrowsable(DebuggerBrowsableState.Never)] new public Object trace => base.trace; }（有和没有数字）的两个实例，您可以添加react以尝试匹配其中一个：

输出：

cleandUpCode = re.sub(r'<!-- react-text: \d{1,3} -->|<!-- /react-text -->', '', sourceCode)

Answer 3

如果您尝试使用python浏览HTML文档，使用名为BeautifulSoup 4的库会更容易，更实用。

您可以从https://pypi.python.org/pypi/beautifulsoup4下载，或者，您可以使用＆＃34; Pip＆＃34;通过在命令行中编写pip install beautifulsoup4来下载它。而不是将其包含在项目from bs4 import BeautifulSoup

中

现在您应该从中提取文本，如果这是您想要做的事情。

from bs4 import BeautifulSoup

    with open "text.txt" as text:
    str = '<span data-reactid="57">Price/Book</span><!-- react-text: 58 --> <!-- /react-text --><!-- react-text: 59 -->(mrq)<!-- /react-text --><sup aria-label="KS_HELP_SUP_undefined" data-reactid="60"></sup></td><td class="Fz(s) Fw(500) Ta(end)" data-reactid="61">8.36</td>'
    soup = BeautifulSoup(str, 'lxml')
    soup = soup.get_text()
    text.write(str(soup))

使用Python表达式删除HTML中的React标记

3 个答案: