Question

我不熟悉HTML并试图提取HTML的主体。首先，我必须过滤HTML的所有元素，但仅仅是文本。当使用BeautifulSoup的方法get_text()时，我收到一些意想不到的结果。

var suffix = device.type === "pc" ? ".pc" : ".mobile";requirejs.config({
paths: {
    "F": "http://y0.ifengimg.com/base/origin/F-amd-1.2.0.min",
    "FM":  "http://y0.ifengimg.com/commonpage/1130/F-amd-mobile-1.1.0.min",
    "debug": "http://y0.ifengimg.com/commonpage/1130/F-amd-mobile-1.1.0.min",

当然包含了文本，但我不想要HTML的功能或其他元素。检查HTML代码后，似乎这些类型的函数或脚本位于2个元素之间<script>和</script>

我想知道是否应该使用re模块或BeautifulSoup来处理我的问题。

已通过方法extract()完成... 但收到另一个错误。看起来像...... <img src***="1"/>

仍然在soup.get_text()。不知道为什么它作为标签不被提取。当然我可以手动删除它，但对于程序员来说这似乎并不优雅。

Answer 1

嗯......看起来我们可以简单地提取它们（从BeautifulSoup对象中移除它们，你的HTML文件）：

>>> soup = BeautifulSoup('<p>Hello</p><script>console.log("A test!")</script>')
>>> soup.get_text()
'Helloconsole.log("A test!")'

>>> soup
<p>Hello</p><script>console.log("A test!")</script>

>>> soup.find('script')
<script>console.log("A test!")</script>

>>> soup.find('script').extract()
<script>console.log("A test!")</script>

>>> soup
<p>Hello</p>

>>> soup.get_text()
'Hello'
>>>

但是，如果您的HTML文件中包含更多script个代码，请使用soup.find_all()代替：

for tag in soup.find_all('script'):
    tag.extract()

print(soup.get_text())

BeautifulSoup用于提取文本

1 个答案: