使用Python下载* .mp4文件

时间:2013-12-21 21:05:02

标签: python beautifulsoup urllib2

我正在尝试从网站下载并保存讲座视频。虽然我已成功下载文件,但它们无法在我的媒体播放器中播放。这是我正在使用的代码:

from bs4 import BeautifulSoup
import re
import urllib2

snippet = open('Python/SNA Page Source Revised.txt', 'r')
soup = BeautifulSoup(snippet)

links = [link.get('href') for link in soup.find_all('a')]

videos = []

for link in links:
  match = re.search('.*mp4.*', link)
  if match:
    videos.append(link)

vidNum = 1

for video in videos:
  f = urllib2.urlopen(video)
  with open('Data Analysis/Social Network Analysis/Video '+vidNum+'.mp4', 'wb') as code:
    code.write(f.read())
  vidNum += 1

一切似乎都运行良好,但当我尝试播放其中一个视频时,我收到此错误: “Python(v2.7)需要安装插件来播放以下类型的媒体文件:text / html decoder”另外,如果我手动从网站下载视频,文件大约是22.8MB,但是当我使用我的时候脚本,文件只有7.8kB。

我在下载文件的方式有问题吗?任何帮助将不胜感激。

另外:我正在使用Python v2.7在Ubuntu 12.04 LTS操作系统上运行。

**** 修改 * ***

以下是我根据收到的回复使用的代码:

import requests

r = requests.get('https://class.coursera.org/sna-003/lecture/download.mp4?lecture_id=2', auth=('myUsername', 'myPassword'))

with open('Data Analysis/TestFile.mp4', 'wb') as fd:
  fd.write(r.content)

这是r.content的输出:

<!DOCTYPE html>
<html itemtype="http://schema.org" xmlns:fb="http://ogp.me/ns/fb#"><head><meta content="IE=Edge,chrome=IE7" http-equiv="X-UA-Compatible"/><meta content="!" name="fragment"/><meta content="NOODP" name="robots"/><meta charset="utf-8"/><meta content="Coursera" property="og:title"/><meta content="website" property="og:type"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" property="og:image"/><meta content="https://www.coursera.org/" property="og:url"/><meta content="Coursera" property="og:site_name"/><meta content="en_US" property="og:locale"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." property="og:description"/><meta content="727836538,4807654" property="fb:admins"/><meta content="274998519252278" property="fb:app_id"/><meta content="Take free online classes from 80+ top universities and organizations. Coursera is a social entrepreneurship company partnering with Stanford University, Yale University, Princeton University and others around the world to offer courses online for anyone to take, for free. We believe in connecting people to a great education so that anyone around the world can learn without limits." name="description"/><meta content="http://s3.amazonaws.com/coursera/media/Coursera_Computer_Narrow.png" name="image"/><meta content="app-id=736535961" name="apple-itunes-app"/><script>window.onerror = function(message, url, lineNum) {

  // First check the URL and line number of the error
  url = url || window.location.href;
  // 99% of the time, errors without line numbers arent due to our code,
  // they are due to third party plugins and browser extensions
  if (lineNum === undefined || lineNum == null) return;

  // Now figure out the actual error message
  // If it's an event, as triggered in several browsers
  if (message.target &amp;&amp; message.type) {
    message = message.type;
  }
  if (!message.indexOf) {
    message = 'Non-string, non-event error: ' + (typeof message);
  }

  var errorDescrip = {
    message: message,
    script: url,
    line: lineNum,
    url: document.URL
  }

  var err = {
    key: 'page.error.javascript', 
    value: errorDescrip
  }

  window._204 = window._204 || [];
  window._204.push(err);

  window._gaq = window._gaq || [];
  window._gaq.push(err);
}</script><title>Coursera.org</title><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/css/home.css" rel="stylesheet" type="text/css"/><link href="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/pages/auth/css/auth.css" rel="stylesheet" type="text/css"/><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" id="_mobile">(function(el) {
  // Override certian behaviour if the page is for our mobile app.
  // TODO(priya) Remove this conditional behaviour once I want to push this behaviour
  // for regular authentication pages on mobile/smaller screens as well.
  // Currently I'm keeping existing behaviour same and only adding mobile specific
  // layouts ot /mobilesignup page (which is what isMobileApp = true signifies).
  if ("false" == "true") {
    var head = document.getElementsByTagName('head')[0];
    // Add viewport meta tag
    var viewport = document.querySelector('meta[name=viewport]');
    var viewportContent = 'width=device-width, initial-scale=1.0, user-scalable=no';
    if (!viewport) {
        viewport = document.createElement('meta');
        viewport.setAttribute('name', 'viewport');
        head.appendChild(viewport);
    }
    viewport.setAttribute('content', viewportContent);

    // Add responsive css
    var link  = document.createElement('link');
    link.rel  = 'stylesheet';
    link.type = 'text/css';
    link.href = el.getAttribute("data-baseurl") + "pages/auth/css/auth_responsive.css";
    head.appendChild(link);
  }
})(document.getElementById("_mobile"));
</script></head><body><div id="fb-root"></div><div id="origami"><div style="position:absolute;top:0px;left:0px;width:100%;height:100%;background:#f5f5f5;padding-top:5%;"><div id="coursera-loading-nojs" style="text-align:center; margin-bottom:10px;display:none;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div><div><span id="coursera-loading-js" style="display: none; padding-left:45%">loading   <img src="https://d2wvvaown1ul17.cloudfront.net/site-static/images/icons/loading.gif"/></span></div><noscript><div style="text-align:center; margin-bottom:10px;">Please use a <a href="/browsers">modern browser </a> with JavaScript enabled to use Coursera.</div></noscript></div></div><!--[if gte IE 8]&gt;&lt;script&gt;document.getElementById("coursera-loading-js").style.display = 'block';&lt;/script&gt;&lt;![endif]-->
<!--[if lte IE 7]&gt;&lt;script&gt;document.getElementById("coursera-loading-nojs").style.display = 'block';
window._204 = window._204 || [];
window._gaq = window._gaq || [];

window._gaq.push(
    ['_setAccount', 'UA-28377374-1'],
    ['_setDomainName', window.location.hostname],
    ['_setAllowLinker', true],
    ['_trackPageview', window.location.pathname]);

window._204.push(
  ['client', 'home'],
  {key:"pageview", value:window.location.pathname});
  &lt;/script&gt;&lt;script src="https://eventing.coursera.org/204.min.js"&gt;&lt;/script&gt;&lt;script src="https://ssl.google-analytics.com/ga.js"&gt;&lt;/script&gt;&lt;![endif]-->
<!--[if !IE]&gt; --><script>document.getElementById("coursera-loading-js").style.display = 'block';</script><!-- &lt;![endif]--><script src="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/js/core/require.js" type="text/javascript"></script><script data-baseurl="https://d1rlkby5e91r2j.cloudfront.net/e47434615f57601f9b9ccaf255a589e8550d328d/" data-debug="0" data-locale="" data-timestamp="1386838999742" data-version="e47434615f57601f9b9ccaf255a589e8550d328d" id="_require" type="text/javascript">if(document.getElementById("coursera-loading-js").style.display == 'block') {
  (function(el) {
     // prevent throw
     require.onError = function(err) {
       window._204 = window._204 || [];
       window._204.push({key: 'requireErr', value: err});
     };

     define("pages/auth/authConfig",
         function() {
             return {"coursera_url": "https://www.coursera.org/",
                     "environment": "production"};
     }
     );

     require.config({
       enforceDefine: false,
       waitSeconds: 14,
       baseUrl: el.getAttribute("data-baseurl"),
       urlArgs: el.getAttribute("data-debug") == "1" ? "v=" + el.getAttribute("data-timestamp") : "",
       shim: {
          "underscore": {
             exports: '_'
          },
          "backbone": {
             deps: ['underscore', 'jquery'],
             exports: 'Backbone'
          }
       },
       paths: {
          "jquery":       "js/core/jquery",
          "underscore":   "js/core/underscore",
          "backbone":     "js/core/backbone",
          "i18n":         "js/core/i18n._t"
       },
       callback: function() {
         require(["pages/auth/routes"]); // bootup coursera
       },
       config: {
         i18n: {
           locale: (window.localStorage ? localStorage.getItem("locale") : '') || el.getAttribute("data-locale")
         }
       }
     });
  })(document.getElementById("_require"));
}</script><script type="text/javascript">define("pages/home/models/user.json", [], function(){
  return null;
});
</script></body></html>

我发现这很奇怪,因为它看起来像网站的源代码,但是当我查看r.url时,我得到一个可以在浏览器中加载的实际网站,它会提示我保存或查看该视频。即使我尝试传递新的URL,我从中获取,我假设包含我的cookie信息,我仍然得到相同的内容。我不明白我哪里出错了。

3 个答案:

答案 0 :(得分:3)

首先,下载并安装requests package

然后使用此代码:

import requests

def downloadfile(name,url):
    name=name+".mp4"
    r=requests.get('url')
    print "****Connected****"
    f=open(name,'wb');
    print "Donloading....."
    for chunk in r.iter_content(chunk_size=255): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    print "Done"
    f.close()

答案 1 :(得分:1)

您需要拥有有效的Cookie,以免下载登录页面。

以下是在urllib2上设置cookie的方法

import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie', 'cookiename=cookievalue'))
f = opener.open("http://example.com/")

此外,您可以使用cookielib更多网络浏览器来执行登录过程并获取正确的Cookie来下载电影。

另一种方法是使用Requests,类似于urllib2,更容易实现自动登录过程。

答案 2 :(得分:1)

我首先将文件保存为.html而不是.mp4,这样您就可以100%确定它不是登录页面/错误页面或其他杂项垃圾。 有些网站需要cookie,特定用户代理(阻止机器人/刮刀/自动漏洞扫描程序),推荐人等等。

我个人使用篡改数据或实时http标头来确保我的程序在调试时有效。

如果您收到了云端响应,那么您可能无法正确处理Cookie /用户代理/引用。

我刚检查了链接,还有一个CSRF cookie {csrf_token = toNQOP7stgOREzrDcbPc},您将100%查看通过登录页面的任何内容。