从Safaribooksonline抓取时,Python请求和BeautifulSoup4 .get(' href')返回绝对地址

时间:2018-03-13 08:21:47

标签: python web-scraping beautifulsoup python-requests

我试图从网页上抓取<a>个标签的内容。我的代码是:

from bs4 import BeautifulSoup
import requests
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

url = 'https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961'

req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')

lessons = soup.find_all('li', class_='toc-level-1')
lesson = lessons[0]
print(lesson)

我的页面有一个元素:(直接从我在Firefox中的DOM检查器的输出中获取)...

<li class="toc-level-1 t-toc-level-1 js-content-uri" data-content-uri="/api/v1/book/9780134985961/chapter/LPOC_00_00_00.html">
   <a href="/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html" class="t-chapter" tabindex="39">Introduction</a>
   <ol>
      <li class="toc-level-2 t-toc-level-2 js-content-uri" data-content-uri="/api/v1/book/9780134985961/chapter/LPOC_00_00_00.html"><a href="/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html" class="t-chapter" tabindex="41">Linux Performance Optimization: Introduction</a></li>
   </ol>
</li>

但是,当我使用请求和bs4模块来抓取数据时,使用上面的代码,我得到的输出是:

<li class="toc-level-1 t-toc-level-1">
    <a class="t-chapter js-chapter" href="https://www.safaribooksonline.comhttps://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html">Introduction</a>
    <ol>
        <li class="toc-level-2 t-toc-level-2">
            <a class="t-chapter js-chapter" href="https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html">Linux Performance Optimization: Introduction</a>
        </li>
    </ol>
</li>

注意<a>标签的href值?他们应该是相对的网址,例如:/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html,但我会得到绝对的网址 - 有时候会错误:https://www.safaribooksonline.comhttps://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html

我不知道域名如何以链接url为前缀,因为只有href值在原始HTML中给出,除非请求或bs4正在执行此操作。我之前使用相同方法的所有脚本也产生类似的错误。模块一边有什么变化,或者我做错了什么?

1 个答案:

答案 0 :(得分:1)

您可以使用正则表达式从href中提取网址:

from bs4 import BeautifulSoup
import requests
import sys
import re

url = 'https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961'

req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
hrefs = set()

for lesson in soup.find_all('li', class_='toc-level-1'):
    for a in lesson.find_all('a', href=True):
        found_urls = re.split(r'(https?:\/\/.*?)', a['href'])
        hrefs.add(found_urls[-2] + found_urls[-1])

for href in sorted(hrefs):
    print(href)

给你一个找到的hrefs列表:

https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_00_00_00.html
https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_01_00_00.html
https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_01_01_00.html
https://www.safaribooksonline.com/library/view/linux-performance-optimization/9780134985961/LPOC_01_01_01.html