附录：

Question

我的HTML文本看起来像这样..我想在python中使用REGEX从HTML文本中提取PLAIN TEXT（不使用HTML PARSERS）

&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;

如何找到准确的正则表达式来获取纯文本？

Answer 1

您可以使用简单的选择器方法使用Javascript执行此操作，然后检索.innerHTML属性。

//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML 
let text = div[0].innerHTML;

这将选择要检索其HTML的元素，然后它将提取内部HTML文本，假设您只想要HTML标记之间的内容，而不是标记本身。

正则表达式不是必需的。您必须使用JS或某些后端实现Regex，只要您可以在项目中插入JS脚本，就可以获得内部HTML。

如果您正在抓取数据，那么无论使用何种语言，您的库很可能都有选择器方法和方法来轻松检索HTML文本而无需使用正则表达式。

Answer 2

你最好在这里使用解析器：

import html, xml.etree.ElementTree as ET

# decode
string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
    print(p.text)

这会产生

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

显然，您可能想要更改xpath，因此需要look at the possibilities。

<小时/>

附录：

这里可以使用正则表达式，但这种方法 非常容易出错且不可取 ：

import re

string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

这个想法是寻找一个大写字母，并将单词字符，空白和逗号匹配到一个点。请参阅a demo on regex101.com。

使用正则表达式从html标签中提取文本

2 个答案:

附录：