我正在学习如何使用Python进行网页抓取,并获得了以下html文件:
<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>
我打开文件并将其读取到变量exampleSoup。然后我想为作者刮掉它并被告知使用
elems = exampleSoup.select('#author')
然而,这返回了一个空列表。 然后我尝试了
elems = exampleSoup.select('span#author')
并获得了我想要的输出。
我的问题是,为什么第一种方法在这种情况下不起作用?
答案 0 :(得分:0)
from bs4 import BeautifulSoup
htmlFile = """<html>
<head>
<title>The Website Title</title>
</head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body>
</html>"""
soup=BeautifulSoup(htmlFile, 'html.parser')
print(soup.select("#author"))
我收到了所需的输出:
[<span id="author">Al Sweigart</span>]
也许你正在使用旧版本的模块。
答案 1 :(得分:0)
我认为python的版本导致了这个问题
我是usimg:Python 3.6.2和bs 4.6.0
这是我的方法
from bs4 import BeautifulSoup
content = '<html><head><title>The Website Title</title></head><body><p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p><p class="slogan">Learn Python the easy way!</p><p>By <span id="author">Al Sweigart</span></p></body></html>'
soup = BeautifulSoup(content, 'html.parser')
result1 = soup.select("[id='author']")
print (result1) # output [<span id="author">Al Sweigart</span>]
result2 = soup.select('#author')
print (result2) # output [<span id="author">Al Sweigart</span>]
result3 = soup.select('span#author')
print (result3) # output [<span id="author">Al Sweigart</span>]
result4 = soup.span # this how the decumentation did it
print (result4) # output <span id="author">Al Sweigart</span>