从python获取网站的音频源链接

时间:2017-05-05 09:07:53

标签: javascript python html asp.net web-crawler

我正在编写一个脚本来从网站上获取音频源链接。通过抓取主页面,获取可用链接列表。但当我抓取生成的链接时,我无法找到源。 (它应该是< audio>标记的href内部)。

这是我的代码:

DispatchQueue.global().asyncAfter(deadline: .now() + .seconds(1)) {
    // This code will be placed on a background queue and be executed a second later

    DispatchQueue.main.async {
        // Here you may update any GUI elements
    }
}

网站似乎没有正确加载,并且它没有使用urllib.request加载音频源。还有什么我可以使用而不是urllib.request所以它等待整页加载。我所想的是使用一些外部网页浏览器来生成HTML,但我不知道该怎么做

1 个答案:

答案 0 :(得分:3)

这有点棘手,但我们可以一步一步地处理 - 首先按照iframe链接获取播放器的HTML。然后,获取Flash播放器链接并关注它。然后,提取到mp3的链接并下载流。所有这些都在同一个网络抓取会议下:

from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup


def download_file(session, link, path):
    r = session.get(link, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)


base_url = "http://www.e-radio.gr"
url = "http://www.e-radio.gr/Rainbow-89-Thessaloniki-i92/live"

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'}
    response = session.get(url)

    soup = BeautifulSoup(response.content, "html.parser")
    frame = soup.find(id="playerControls1")
    frame_url = urljoin(base_url, frame["src"])

    response = session.get(frame_url)
    soup = BeautifulSoup(response.content, "html.parser")
    link = soup.select_one(".onerror a")['href']
    flash_url = urljoin(response.url, link)

    response = session.get(flash_url)
    soup = BeautifulSoup(response.content, "html.parser")
    mp3_link = soup.select_one("param[name=flashvars]")['value'].split("url=", 1)[-1]
    print(mp3_link)

    download_file(session, mp3_link, "download.mp3")