Question

我正在尝试获取元标记的内容。问题是BS4无法在某些网站上正确解析标签，标签没有按原样关闭。以标签为例，我的功能输出包括大量杂乱，包括其他标签，如脚本，链接等。我相信浏览器会自动关闭头部某处的元标记，这种行为会混淆BS4。 / p>

我的代码适用于此：

<meta name="description" content="content">

并且不能使用：

from bs4 import BeautifulSoup

html = BeautifulSoup(open('/path/file.html'), 'html.parser')
desc = html.find(attrs={'name':'description'})

print(desc)

以下是我的BS4功能代码：

FailingRule

是否可以使用这些未封闭的元标记？

Answer 1

html5lib or lxml parser可以正确处理问题：

In [1]: from bs4 import BeautifulSoup
   ...: 
   ...: data = """
   ...: <html>
   ...:     <head>
   ...:         <meta name="description" content="content">
   ...:         <script>
   ...:             var i = 0;
   ...:         </script>
   ...:     </head>
   ...:     <body>
   ...:         <div id="content">content</div>
   ...:     </body>
   ...: </html>"""
   ...: 

In [2]: BeautifulSoup(data, 'html.parser').find(attrs={'name': 'description'})
Out[2]: <meta content="content" name="description">\n<script>\n            var i = 0;\n        </script>\n</meta>

In [3]: BeautifulSoup(data, 'html5lib').find(attrs={'name': 'description'})
Out[3]: <meta content="content" name="description"/>

In [4]: BeautifulSoup(data, 'lxml').find(attrs={'name': 'description'})
Out[4]: <meta content="content" name="description"/>

Answer 2

获得新东西并希望它可以给你一些帮助，我想每次BeautifulSoup找到一个没有正确结束标记的元素，然后它会继续搜索下一个和下一个元素，直到它的父标记结束标记。也许你仍然不理解我的想法，在这里我做了一个小小的演示：

    hello.html
<!DOCTYPE html>
    <html lang="en">
    <meta name="description" content="content">
    <head>
        <meta charset="UTF-8">
        <title>Title</title>
    </head>
    <div>
    <p class="title"><b>The Dormouse's story</b>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    </p></div>
    </body>
    </html>

并像以前一样运行，并在下面找到结果：

<meta content="content" name="description">
<head>
<meta charset="utf-8">
<title>Title</title>
</meta></head>
<body>
...
</div></body>
</meta>

OK！ BeautifulSoup会自动生成结束元标记，其位置在</body>标记之后，但仍然无法看到元的父结束标记</html>，所以我的意思是结束标记应该反映为与它的开始标记。但是我仍然不能说服自己这样的意见，所以我做了一个测试，删除了<p class='title'>结束标记，因此</p>中只有一个<div>...</div>标记，但是在运行之后

c = soup.find_all('p', attrs={'class':'title'}) print(c[0])

结果中有两个</p>个标签。这就像我之前所说的那样。

使用BS4刮取未封闭的元标记

2 个答案: