Question

我正在尝试解析网站并检索包含超链接的文本。例如：

<a href="www.example.com">This is an Example</a>

我需要检索“这是一个例子”，我可以为没有破坏标签的页面做。我无法在以下情况下检索：

<html>
<body>
<a href = "http:\\www.google.com">Google<br>
<a href = "http:\\www.example.com">Example</a>
</body>
</html>

在这种情况下，代码无法检索Google，因为链接谷歌的标签损坏，只给了我“示例”。有没有办法还可以检索“Google”？

我的代码在这里：

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

f = open("sol.html","r")

soup = BeautifulSoup(f,parse_only=SoupStrainer('a'))
for link in soup.findAll('a',text=True):
    print link.renderContents();

请注意sol.html包含上面给出的html代码本身。

由于 - AJ

Answer 1

从代码中删除text=True，它应该可以正常运行：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <html>
... <body>
... <a href = "http:\\www.google.com">Google<br>
... <a href = "http:\\www.example.com">Example</a>
... </body>
... </html>
... ''')
>>> [a.get_text().strip() for a in soup.find_all('a')]
[u'Google', u'Example']
>>> [a.get_text().strip() for a in soup.find_all('a', text=True)]
[u'Example']

Answer 2

试试这段代码：

from BeautifulSoup import BeautifulSoup

text = '''
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
<a href = "http:\\www.example.com">Example</a>
</body>
</html>
'''

soup = BeautifulSoup(text)

for link in soup.findAll('a'):
    if link.string != None:
        print link.string

这是我运行代码时的输出：

Example

只需将text替换为text = open('sol.html').read()，或者将其转移到那里。

从损坏的<a> tags using Beautiful Soup</a>中检索内容

2 个答案: