使用Python脚本响应中的Web进行Webscrapping

时间:2017-01-06 16:05:55

标签: python python-2.7 web-scraping python-requests

我正试图抓住这个链接,这是我写的代码

import requests
from bs4 import BeautifulSoup
rlink = requests.get('http://videohost.site/play/A11QStEaNdVZfvV/')
print(rlink.content)

现在当我运行浏览器链接时,我得到一个格式正确的HTML,我可以从中选择标记。 例如:

<video class="jw-video jw-reset" x-webkit-airplay="allow" webkit-playsinline="" playsinline="" jw-loaded="data" src="https://redirector.googlevideo.com/videoplayback?requiressl=yes&amp;id=99e7c0d36ff950d2&amp;itag=22&amp;source=webdrive&amp;ttl=transient&amp;app=explorer&amp;ip=2001:67c:2db8:7::3e0&amp;ipbits=32&amp;expire=1483730468&amp;sparams=requiressl%2Cid%2Citag%2Csource%2Cttl%2Cip%2Cipbits%2Cexpire%2Cmm%2Cmn%2Cms%2Cmv%2Cpl&amp;signature=7EFB542F7CE372D5DAD8376254F577926AF8CBEA.857A11ACEB6C65D5D075759B557CE1E114F94F03&amp;key=ck2&amp;mm=31&amp;mn=sn-bungvh5op5-vu2e&amp;ms=au&amp;mt=1483715949&amp;mv=u&amp;pl=48"></video>

但请求模块正在返回一个脚本,该脚本在浏览器中执行,

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta charset="UTF-8" />
      <title>Banjo HD</title>
      <meta property="og:image" content="https://lh6.googleusercontent.com/Eo6aYbkMPiltQ1HE8QXK-2RvCOB8wCgzvqiJqIYEu9DJMSodJwd24g=w1200-h630-p" />
      <link rel="stylesheet" type="text/css" href="http://videohost.site/player/jwplayer/assets/style.css">
      <script src="http://videohost.site/player/jwplayer/assets/jwplayer.js"></script> <script>jwplayer.key = "qCeaX98IpNerwNN2Vlz69NLXFAyMM5a4dyK7Pw==";</script>
   </head>
   <body>
      <div id="player"></div>
      <script type="text/javascript"> eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('1k 5=v("5");5.1l({1m:"14%",1i:"14%",1h:"1n",1q:"w",1p:17,1o:w,1r:"O://15.19/5/v/1a/v.1b.1g",1f:"16:9",1c:"17",1e:"1d",1j:"O",1w:w,1G:[{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=1F&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1I.1K&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1J","2":"1\\/4"},{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=1E&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1C.1D&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1s","2":"1\\/4"},{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=18&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1z.1A&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1B","2":"1\\/4"}],2:"1/4",1y:{3:"",1x:"",},1t:"1u 1v",1H:"O://15.19"});',62,109,'requiressl|video|type|file|mp4|player|32||expire||ipbits|3e0|ip|2001|67c|1483730468|sparams|2Cid|2Citag|2Cttl|2Cip|2Cipbits|explorer|2Csource|ttl|googlevideo|com|videoplayback|redirector|https||jwplayer|false|yes|id|2Cexpire|transient|webdrive|source|99e7c0d36ff950d2|itag|app|2db8|au|mt|ms|vu2e|bungvh5op5|1483715949|pl|http|2Cmm|label|48|sn|mv|2Cmv|2Cpl|signature|2Cms|key|ck2|2Cmn|mn|31|mm|100|videohost||true||site|assets|flash|fullscreen|html5|primary|aspectratio|swf|skin|height|provider|var|setup|width|seven|displaytitle|controls|preload|flashplayer|480P|abouttext|Video|Host|autostart|link|logo|3648867A489010D7BFA1A2E6C64F4035FDEB3814|6617735E622564ACA4793459986706DA936E58DE|360P|9FBCFB9752833B2DD83BFD6547551604AA6A340D|A55D1440195C2AF6945EE4A20DB8147CDC50F337|59|22|sources|aboutlink|7EFB542F7CE372D5DAD8376254F577926AF8CBEA|720P|857A11ACEB6C65D5D075759B557CE1E114F94F03'.split('|'),0,{})) </script><!-- Code --><script type="text/javascript" data-cfasync="false"> var _pop = _pop || []; _pop.push(['siteId', 1630926]); _pop.push(['minBid', 0]); _pop.push(['popundersPerIP', 0]); _pop.push(['delayBetween', 0]); _pop.push(['default', false]); _pop.push(['defaultPerDay', 0]); _pop.push(['topmostLayer', false]); (function() { var pa = document.createElement('script'); pa.type = 'text/javascript'; pa.async = true; var s = document.getElementsByTagName('script')[0]; pa.src = '//c1.popads.net/pop.js'; pa.onerror = function() { var sa = document.createElement('script'); sa.type = 'text/javascript'; sa.async = true; sa.src = '//c2.popads.net/pop.js'; s.parentNode.insertBefore(sa, s); }; s.parentNode.insertBefore(pa, s); })();</script><!-- Code End --><script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-88363984-1', 'auto'); ga('send', 'pageview');</script>
   </body>
</html>

任何有关如何继续获取最终HTML的指示都将受到高度赞赏。

关于PhantomJS的任何想法,我的运行方式与下面建议的相同,但是使用PhantomJS驱动程序和搜索voideo标签的时间超时,因为我认为脚本没有像FireFox一样执行。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get('http://videohost.site/play/A11QStEaNdVZfvV/')
# driver.execute_script('')

# wait for "video" to be present
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "video")))

# get the src value
print(video.get_attribute("src"))

driver.close()

2 个答案:

答案 0 :(得分:2)

为了扩展Emett的答案,以下是一个使用selenium的示例工作代码,可以打开Firefox(您不必使用Firefox) - 支持多种浏览器,包括< em>无头 PhantomJS),等待video元素出现并获得src值:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://videohost.site/play/A11QStEaNdVZfvV/')

# wait for "video" to be present
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "video")))

# get the src value
print(video.get_attribute("src"))

driver.close()

答案 1 :(得分:1)

请求和网页抓取不会呈现JavaScript。您需要运行类似Selenium的内容。唯一的问题是它将打开一个浏览器,它可能会相当慢。要进一步解决该问题,您需要使用像ghost.py这样的无头浏览器系统。