Question

我是一个新的报废，并希望得到一些帮助或只是朝着正确的方向前进。我目前尝试使用scrapy，但根本无法使用scrapy。我想要做的就是从这个page获得标题，剧集和html 5视频播放器链接+不同质量（480p，720p等等）。我不确定我是如何从iframe元素中获取视频src的。

如前所述，任何帮助都会非常有帮助。

感谢。

Answer 1

我以前没有Scrapy的经验，但我自己也在Python Web Scraping项目中。我使用BeautifulSoup进行抓取。

我已经编写了部分代码 - 这会获取所有标题，剧集，缩略图，并加载指向新页面的链接以供进一步处理。如果您遇到更多麻烦，请留言;）

from bs4 import BeautifulSoup
from urllib import request

url = "http://getanime.to/recent"
h = {'User-Agent': 'Mozilla/5.0'}
req = request.Request(url, headers=h)
data = request.urlopen(req)
soup = BeautifulSoup(data)
# print(soup.prettify()[:1000]) # For testing purposes - should print out the first 1000 characters of the HTML document

links = soup.find_all('a', class_="episode-release")
for link in links:
    # Get required info from this link
    thumbnail = link.find('div', class_="thumbnail")["style"]
    thumbnail = thumbnail[22:len(thumbnail)-3]
    title = link.find('div', class_="title-text").contents[0].strip()
    episode = link.find('div', class_="super-block").span.contents[0]
    href = link["href"]
    # print(thumbnail, title, episode, href) # For testing purposes

    # Load the link to this episode for further processing
    req2 = request.Request(href, headers=h)
    data2 = request.urlopen(req2)
    soup2 = BeautifulSoup(data2)

    vid_sources = soup2.find('ul', class_="dropdown-menu dropdown-menu--top video-sources")
    # TODO repeat the above process to find all video sources

编辑：上面的代码是针对python3的。澄清。

Answer 2

（发布另一个答案，因为评论删除了换行符）：

当然，很乐意提供帮助;）你们正走在正确的轨道上，所以请坚持下去。我想知道你为什么使用find_all('iframe')，因为我找不到任何包含多个iframe的例子，但我猜它也会起作用。如果您知道只有一个，则可以节省一些时间来使用soup.find()。

使用type(iframexx)向我显示它指向包含我们想要的实际数据的列表。然后

for iframe in iframexx:
    print(type(iframexx))
    ifr = iframexx[0]
    print(ifr)
    print(ifr["data-src"])

允许我获取data-src。

遇到Python Web Scraper问题

2 个答案: