Question

在下面的页面上 - ＆gt; link，我正在尝试使用BeautifulSoup来提取最底部的<a>文本，即'Private Life'和'Lost Boy'。

但是我很难抓取<iframe>内容。

我了解到它需要来自浏览器的不同请求。

所以我试过了：

iframexx = soup.find_all('iframe')
for iframe in iframexx:
    try:
        response = urllib2.urlopen(iframe)
        results = BeautifulSoup(response)
        print results

但返回None。

如何解析下面的html，以便我可以获取每个a['href'].get_text()？

Answer 1

浏览器会在单独的请求中加载iframe内容，因此您需要获取iframe src中存在的网址。如果需要，您可以使用selenium，或直接刮取数据本身。这是一个例子：

import requests
import re

url = 'https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/310079005&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false'

response = requests.get(url)

Artist = re.search(b'(?<=artist":")(.*?)(?=")', response.content).group(0).decode("utf-8")
Song = re.search(b'(?<=title":")(.*?)(?=")', response.content).group(0).decode("utf-8")

print ("%s - %s" % (Artist, Song))

私人生活 - 失落的男孩

使用BeautifulSoup提取iFrame内容

1 个答案: