Question

我正在学习如何使用Python进行网页抓取，并获得了以下html文件：

<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>

我打开文件并将其读取到变量exampleSoup。然后我想为作者刮掉它并被告知使用

elems = exampleSoup.select('#author')

然而，这返回了一个空列表。然后我尝试了

elems = exampleSoup.select('span#author')

并获得了我想要的输出。

我的问题是，为什么第一种方法在这种情况下不起作用？

Answer 1

    from bs4 import BeautifulSoup
    htmlFile = """<html>
    <head>
    <title>The Website Title</title>
    </head>
    <body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body>
</html>"""

    soup=BeautifulSoup(htmlFile, 'html.parser')
    print(soup.select("#author"))

我收到了所需的输出： [<span id="author">Al Sweigart</span>] 也许你正在使用旧版本的模块。

Answer 2

我认为python的版本导致了这个问题

我是usimg：Python 3.6.2和bs 4.6.0

这是我的方法

from bs4 import  BeautifulSoup

content = '<html><head><title>The Website Title</title></head><body><p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p><p class="slogan">Learn Python the easy way!</p><p>By <span id="author">Al Sweigart</span></p></body></html>'
soup = BeautifulSoup(content, 'html.parser')

result1 = soup.select("[id='author']")
print (result1) # output [<span id="author">Al Sweigart</span>]

result2 = soup.select('#author')
print (result2) # output [<span id="author">Al Sweigart</span>]

result3 = soup.select('span#author')
print (result3) # output [<span id="author">Al Sweigart</span>]

result4 = soup.span # this how the decumentation did it 
print (result4) # output <span id="author">Al Sweigart</span>

刮刮id =＆＃34;作者＆＃34;用美丽的汤

2 个答案: