如何从html页面中提取没有标记标签的文本内容?

时间:2019-03-29 12:50:32

标签: javascript python html css

我只想从html页面中提取不包括标记的文本。如何在python(最好是)或Java脚本中实现此目标?

对于以下代码:

<div id = #one>
 OneDivision
 <div id = #two>TwoDivision</div>
 <span>SpanElement</span>
</div>

我的输出应该是: OneDivision TwoDivision SpanElement

3 个答案:

答案 0 :(得分:1)

超级容易!在Javascript中,使用textContent。请参阅以下代码

console.log(document.getElementById("one").textContent);
<div id = "one">
 OneDivision
 <div id = "two">TwoDivision</div>
 <span>SpanElement</span>
</div>

答案 1 :(得分:0)

from bs4 import BeautifulSoup
html = '<div id = #one>OneDivision<div id = #two>TwoDivision</div><span>SpanElement</span></div>'
soup = BeautifulSoup(html,"lxml")
print(soup.get_text(separator=' '))

输出

'OneDivision TwoDivision SpanElement'

答案 2 :(得分:0)

html_doc = BeautifulSoup(html, 'lxml').body

if html_doc is None:
    return None

for tag in html_doc.select('script'):
    tag.decompose()
for tag in html_doc.select('style'):
    tag.decompose()

text = html_doc.get_text(separator='\n')