我只想从html页面中提取不包括标记的文本。如何在python(最好是)或Java脚本中实现此目标?
对于以下代码:
<div id = #one>
OneDivision
<div id = #two>TwoDivision</div>
<span>SpanElement</span>
</div>
我的输出应该是: OneDivision TwoDivision SpanElement
答案 0 :(得分:1)
超级容易!在Javascript中,使用textContent
。请参阅以下代码
console.log(document.getElementById("one").textContent);
<div id = "one">
OneDivision
<div id = "two">TwoDivision</div>
<span>SpanElement</span>
</div>
答案 1 :(得分:0)
from bs4 import BeautifulSoup
html = '<div id = #one>OneDivision<div id = #two>TwoDivision</div><span>SpanElement</span></div>'
soup = BeautifulSoup(html,"lxml")
print(soup.get_text(separator=' '))
输出
'OneDivision TwoDivision SpanElement'
答案 2 :(得分:0)
html_doc = BeautifulSoup(html, 'lxml').body
if html_doc is None:
return None
for tag in html_doc.select('script'):
tag.decompose()
for tag in html_doc.select('style'):
tag.decompose()
text = html_doc.get_text(separator='\n')