好吧,我试图抓一个网站http://www.popsci.com/thorium-dream用于学习目的。
我尝试抓取它来获取视频src但不能这样,因为视频标记是通过javascript注入的。
查看网络,看xhr请求是否看到了视频的媒体文件请求。
General
Remote Address:68.232.45.253:80
Request URL:http://video.net2.tv/PORTICO/TECH/POPSCI/POP_84/POP_20140718_84_Thorium_A/POP_20140718_84_Thorium_A_1200.mp4
Request Method:GET
Status Code:206 Partial Content (from cache)
Response Headers
Accept-Ranges:bytes
Cache-Control:max-age=604800
Content-Length:24833827
Content-Range:bytes 0-24833826/24833827
Content-Type:video/mp4
Date:Mon, 14 Sep 2015 02:54:29 GMT
Etag:"734657553"
Expires:Mon, 21 Sep 2015 02:54:29 GMT
Last-Modified:Fri, 18 Jul 2014 21:56:46 GMT
Server:ECAcc (cpm/F8B9)
X-Cache:HIT
Request Headers
Provisional headers are shown
Accept-Encoding:identity;q=1, *;q=0
Range:bytes=0-
Referer:http://player.net2.tv/?episode=53c9973ae7dbcc820502c81c&restart=true&snipe=true
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36
如何从抓取中获取网址? 如果可能的话,请使用默认的python库告诉解决方案。
答案 0 :(得分:1)
我为你编写了一些东西。它从POPSCI剧集页面中提取所有视频:
import re
import requests
from lxml import html
def getVideosLinks(content):
videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
return videos
def prepareJSONurl(episode_hash):
json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
return json_url
def extractEpisodeHash(content):
tree = html.fromstring(content)
video_url = tree.xpath('//meta[contains(@http-equiv, "refresh")]/@content')[0].split('=',1)[1]
episode_hash = re.findall('episode=([\w]+)', video_url)
return episode_hash[0]
def extractIframeURL(content):
iframe_url = None
tree = html.fromstring(content)
try:
iframe_url = tree.xpath('//iframe/@src')[0]
is_video = True
except:
is_video = False
return is_video, iframe_url
POPSCI_URL = "http://www.popsci.com/thorium-dream"
response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)
if is_video:
response_from_iframe_url = requests.get(iframe_url)
episode_hash = extractEpisodeHash(response_from_iframe_url.content)
json_url = prepareJSONurl(episode_hash)
final_response = requests.get(json_url)
for video in getVideosLinks(final_response.content):
print "Video: {}".format(video)
else:
print "This is not a POPSCI video page :|"
它们具有不同的视频质量和大小,因此每集都会看到多个.mp4视频网址。
此代码适用于任何POPSCI剧集页面,请尝试将POPSCI_URL更改为...
POPSCI_URL = "http://www.popsci.com/maker-faire-2015"
......它仍然可以使用。
即便如此,也不建议使用正则表达式解析HTML(regexp)我已经为您创建了一个正则表达式版本(根据要求)。它有效但正则表达式可以改进:
import re
import requests
def getVideosLinks(content):
videos = re.findall('(http://[\.\w/_]+\.mp[34])', content)
return videos
def prepareJSONurl(episode_hash):
json_url = "http://pepto.portico.net2.tv/playlist/{hash}".format(hash=episode_hash)
return json_url
def extractEpisodeHash(content):
episode_hash = re.findall('<meta http-equiv="refresh" content="0; url=http:\/\/player\.net2\.tv\?episode=([\w]+)&restart', content)[0]
return episode_hash
def extractIframeURL(content):
iframe_url = None
try:
iframe_url = re.findall('<iframe src="(.*)" style', content)[0]
is_video = True
except:
is_video = False
return is_video, iframe_url
POPSCI_URL = "http://www.popsci.com/thorium-dream"
response = requests.get(POPSCI_URL)
is_video, iframe_url = extractIframeURL(response.content)
if is_video:
response_from_iframe_url = requests.get(iframe_url)
episode_hash = extractEpisodeHash(response_from_iframe_url.content)
json_url = prepareJSONurl(episode_hash)
final_response = requests.get(json_url)
for video in getVideosLinks(final_response.content):
print "Video: {}".format(video)
else:
print "This is not a POPSCI video page :|"
希望这有帮助