如何从通过javascript注入的视频标签中抓取视频src网址?

时间:2015-09-18 12:27:09

标签: python web web-scraping

好吧,我试图抓一个网站http://www.popsci.com/thorium-dream用于学习目的。

我尝试抓取它来获取视频src但不能这样,因为视频标记是通过javascript注入的。

查看网络,看xhr请求是否看到了视频的媒体文件请求。

General
Remote Address:68.232.45.253:80
Request URL:http://video.net2.tv/PORTICO/TECH/POPSCI/POP_84/POP_20140718_84_Thorium_A/POP_20140718_84_Thorium_A_1200.mp4
Request Method:GET
Status Code:206 Partial Content (from cache)
Response Headers
Accept-Ranges:bytes
Cache-Control:max-age=604800
Content-Length:24833827
Content-Range:bytes 0-24833826/24833827
Content-Type:video/mp4
Date:Mon, 14 Sep 2015 02:54:29 GMT
Etag:"734657553"
Expires:Mon, 21 Sep 2015 02:54:29 GMT
Last-Modified:Fri, 18 Jul 2014 21:56:46 GMT
Server:ECAcc (cpm/F8B9)
X-Cache:HIT
Request Headers
Provisional headers are shown
Accept-Encoding:identity;q=1, *;q=0
Range:bytes=0-
Referer:http://player.net2.tv/?episode=53c9973ae7dbcc820502c81c&restart=true&snipe=true
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36

如何从抓取中获取网址? 如果可能的话,请使用默认的python库告诉解决方案。

1 个答案:

答案 0 :(得分:1)

我为你编写了一些东西。它从POPSCI剧集页面中提取所有视频:

import re
import requests
from lxml import html

def getVideosLinks(content):
    videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
    return videos

def prepareJSONurl(episode_hash):
    json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
    return json_url

def extractEpisodeHash(content):
    tree = html.fromstring(content)
    video_url = tree.xpath('//meta[contains(@http-equiv, "refresh")]/@content')[0].split('=',1)[1]
    episode_hash = re.findall('episode=([\w]+)', video_url)
    return episode_hash[0]

def extractIframeURL(content):
    iframe_url = None
    tree = html.fromstring(content)
    try:
        iframe_url = tree.xpath('//iframe/@src')[0]
        is_video = True
    except:
        is_video = False
    return is_video, iframe_url


POPSCI_URL = "http://www.popsci.com/thorium-dream"

response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)

if is_video:
    response_from_iframe_url = requests.get(iframe_url)
    episode_hash = extractEpisodeHash(response_from_iframe_url.content)

    json_url = prepareJSONurl(episode_hash)
    final_response = requests.get(json_url)

    for video in getVideosLinks(final_response.content):
        print "Video: {}".format(video)
else:
    print "This is not a POPSCI video page :|"

它们具有不同的视频质量和大小,因此每集都会看到多个.mp4视频网址。

此代码适用于任何POPSCI剧集页面,请尝试将POPSCI_URL更改为...

POPSCI_URL = "http://www.popsci.com/maker-faire-2015"

......它仍然可以使用。

增加:

即便如此,也不建议使用正则表达式解析HTML(regexp)我已经为您创建了一个正则表达式版本(根据要求)。它有效但正则表达式可以改进:

import re
import requests

def getVideosLinks(content):
    videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
    return videos

def prepareJSONurl(episode_hash):
    json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
    return json_url

def extractEpisodeHash(content):
    episode_hash = re.findall('<meta http-equiv="refresh" content="0; url=http:\/\/player\.net2\.tv\?episode=([\w]+)&restart', content)[0]
    return episode_hash

def extractIframeURL(content):
    iframe_url = None
    try:
        iframe_url = re.findall('<iframe src="(.*)" style', content)[0]
        is_video = True
    except:
        is_video = False
    return is_video, iframe_url


POPSCI_URL = "http://www.popsci.com/thorium-dream"

response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)

if is_video:
    response_from_iframe_url = requests.get(iframe_url)
    episode_hash = extractEpisodeHash(response_from_iframe_url.content)

    json_url = prepareJSONurl(episode_hash)
    final_response = requests.get(json_url)

    for video in getVideosLinks(final_response.content):
        print "Video: {}".format(video)
else:
    print "This is not a POPSCI video page :|"

希望这有帮助