刮内框架HTML

时间:2017-02-08 00:24:02

标签: python beautifulsoup

我有一个Python脚本,可以在html页面中删除src元素的<video>属性。使用this page视频上的浏览器检查器,我可以看到我需要抓取的视频元素,但直接查看页面源只会显示ember应用程序JavaScript文件。

如何访问包含<video>元素的“内部框架”标记,以便抓取src属性,我需要做什么?

编辑所以它不是那么广泛

2 个答案:

答案 0 :(得分:7)

无需使用完整的浏览器/ selenium路线。再做一点调查,你会看到它是如何运作的:

对于藤蔓网址https://vine.co/v/i3pQ70vK3iv,您需要描述视频的json文件。

如此简单地抓取网址https://archive.vine.co/posts/i3pQ70vK3iv.json。这将返回如下文件:

{
  "username": "Bleacher Report",
  "userIdStr": "906307026416705536",
  "postId": 1352573572862066700,
  "verified": 1,
  "description": "",
  "created": "2016-06-09T06:14:43.000000",
  "permalinkUrl": "https://vine.co/v/i3pQ70vK3iv",
  "userId": 906307026416705500,
  "profileBackground": "0x333333",
  "vanityUrls": [
    "BleacherReport"
  ],
  "entities": [],
  "postIdStr": "1352573572862066688",
  "comments": 293,
  "reposts": 2384,
  "videoLowURL": "http://mtc.cdn.vine.co/r/videos_r2/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=oVIxbcFKL5aaqsbMx_q.7wt4zEnhgQ0w",
  "loops": 19182516,
  "videoUrl": "http://mtc.cdn.vine.co/r/videos/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=av0W8OaLWSzghq.9__iKdSU4y75FDNg.",
  "videoDashUrl": "http://mtc.cdn.vine.co/r/videos_dashhd/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=98zVYTYAx16DJka7Oa1yQu20utGrQch9",
  "thumbnailUrl": "http://v.cdn.vine.co/r/thumbs/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4.jpg?versionId=7LmJNEI3C6bsHkF3t9jqu5k1O2xEHo9l",
  "explicitContent": 0,
  "likes": 6593
}

您可以在返回的json中找到视频本身的网址作为videoUrl属性。

答案 1 :(得分:2)

JS在客户端上运行以填充页面的视频元素,因此您需要一个Web驱动程序让页面完全填充以访问元素。你可以试试selenium:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://vine.co/v/i3pQ70vK3iv")
video = driver.find_element_by_tag_name('video')
print video.get_attribute('src')
driver.close()

如果你想运行驱动程序'无头'(没有gui),请参阅Is it possible to run selenium (Firefox) web driver without a GUI?