网页抓取视频

时间:2018-11-07 19:37:40

标签: python video screen-scraping

我正在尝试通过在https://www.watchcartoononline.com/bobs-burgers-season-9-episode-3-tweentrepreneurs下载Bob's Burgers的电视节目来做概念证明。

我不知道如何从该网站提取视频网址。我使用Chrome和Firefox Web开发人员工具来确定它是否位于iframe中,但是使用BeautifulSoup搜索iframe提取src网址,会返回与视频无关的链接。 mp4或flv文件的引用在哪里(我在开发人员工具中看到的-即使禁止单击它们)。

如果对使用BeautifulSoup进行视频网页抓取和请求的任何理解,将不胜感激。

如果需要,这里有一些代码。很多教程都说要使用'a'标签,但是我没有收到任何'a'标签。

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.watchcartoononline.com/bobs-burgers-season-9-episode-5-live-and-let-fly")
soup = BeautifulSoup(r.content,'html.parser')
links = soup.find_all('iframe')
for link in links:
    print(link['src'])

2 个答案:

答案 0 :(得分:2)

背景资料

(一直向下滚动以获得答案)

只有在您尝试从中获取视频格式的网站在 HTML 中明确说明时,才能轻松获得。例如,如果您想通过引用 .mp4 URL 从您选择的站点获取 .mp4 文件,那么如果我们在这里使用此站点; https://4anime.to/yakunara-mug-cup-mo-episode-01-1?id=45314 如果我们在检查元素中查找 <video>there will be an src containing the .mp4

现在,如果我们尝试像这样从该网站获取 .mp4 网址

import requests
from bs4 import BeautifulSoup 


html_url = "https://4anime.to/yakunara-mug-cup-mo-episode-01-1?id=45314"
html_response = requests.get(html_url) 
soup = BeautifulSoup(html_response.text, 'html.parser') 


for mp4 in soup.find_all('video'):
    mp4 = mp4['src']

print(mp4)

我们会得到一个 KeyError: 'src' 输出。发生这种情况是因为实际视频存储在 source 中,如果我们打印出 soup.find_all('video')

中的值,我们可以查看该视频
import requests
from bs4 import BeautifulSoup 


html_url = "https://4anime.to/yakunara-mug-cup-mo-episode-01-1?id=45314"
html_response = requests.get(html_url) 
soup = BeautifulSoup(html_response.text, 'html.parser') 


for mp4 in soup.find_all('video'):
    pass

print(mp4)

输出:

<video class="video-js vjs-default-skin vjs-big-play-centered" controls="" data-setup="{}" height="264" id="example_video_1" poster="" preload="none" width="640">
<source src="https://mountainoservo0002.animecdn.com/Yakunara-Mug-Cup-mo/Yakunara-Mug-Cup-mo-Episode-01.1-1080p.mp4" type="video/mp4"/>
</video>

因此,如果我们希望现在下载 .mp4,我们将使用 source 元素并从中获取 src

import requests
import shutil # - - This module helps to transfer information from 1 file to another 
from bs4 import BeautifulSoup # - - We could honestly do this without soup


# - - Get the url of the site you want to scrape
html_url = "https://4anime.to/yakunara-mug-cup-mo-episode-01-1?id=45314"
html_response = requests.get(html_url) 
soup = BeautifulSoup(html_response.text, 'html.parser') 

# - - Get the .mp4 url and the filename 
for vid in soup.find_all('source'):
    url = vid['src']
    filename = vid['src'].split('/')[-1]

# - - Get the video 
response = requests.get(url, stream=True)

# - - Make sure the status is OK
if response.status_code == 200:
    # - - Make sure the file size is not 0
    response.raw.decode_content = True

    with open(filename, 'wb') as f:
        # - - Copy what's in response.raw and transfer it into the file
        shutil.copyfileobj(response.raw, f)
 

(您显然可以通过手动复制源的 src 并将其用作基本 URL 而不必使用 html_url 来简化此操作我只是想向您展示您可以选择引用 .mp4(又名来源的 src))

再次重申,并非每个网站都如此明确。特别是对于这个站点,我们很幸运它是可以管理的。您可能尝试从中抓取视频的其他网站可能必须要求您从 Elements(在检查元素中)转到 Network。在那里,您必须尝试获取嵌入链接的片段并尝试将它们全部下载以组成完整的视频,但再一次,并不总是那么容易,但是您请求的网站的视频是。< /p>

你的答案

转到检查元素,单击位于视频顶部的 Chromecast Player (2. Player) 以查看 HTML 属性,最后单击应如下所示的嵌入

/inc/embed/embed.php?file=bobs.burgers.s09e05.flv&amp;hd=1&amp;pid=437035&amp;h=25424730eed390d0bb4634fa93a2e96c&amp;t=1618011716&amp;embed=cizgi

完成后,单击播放,确保检查元素已打开,单击视频以查看属性(或按 ctrl+f 过滤 <video>)并复制应为的 src

https://cdn.cizgifilmlerizle.com/cizgi/bobs.burgers.s09e05.mp4?st=f9OWlOq1e-2M9eUVvhZa8A&e=1618019876

现在我们可以用python下载了。

import requests
# - - This module helps to transfer information from 1 file to another 
import shutil

   
url = "https://cdn.cizgifilmlerizle.com/cizgi/bobs.burgers.s09e05.mp4?st=f9OWlOq1e-2M9eUVvhZa8A&e=1618019876"

response = requests.get(url, stream=True)

if response.status_code == 200:
    # - - Make sure the file size is not 0
    response.raw.decode_content = True

    with open('bobs-burgers.mp4', 'wb') as f:
        #  - - Take the data from response.raw and transfer it to the file
        shutil.copyfileobj(response.raw, f)
    print('downloaded file')
else:
    print('Download failed')

答案 1 :(得分:0)

import requests
url = "https://disk19.cizgifilmlerizle.com/cizgi/bobs.burgers.s09e03.mp4?st=_EEVz36ktZOv7ZxlTaXZfg&e=1541637622"
def download_file(url,filename):
    # NOTE the stream=True parameter
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                #f.flush() commented by recommendation from J.F.Sebastian       
    return filename

download_file(url,"bobs.burgers.s09e03.mp4")

此代码会将特定情节下载到您的计算机上。视频网址嵌套在<video>标签的<source>标签内。