我正在尝试解析网站并获取4个视频文件的网址。 链接示例: https://cs510400.vk.me/3/u381845574/videos/e8f1419d5b.720.mp4
首先我对HTML代码进行grub并找到包含我的链接的标记。并找到我的链接当前行。
我的代码:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://vk.com/video-63758929_456249306')
soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')
current_tag = scripts[-1].string
links = re.findall('^.*source.*$',current_tag,re.MULTILINE)
current_line = []
for x in links:
current_line.append(x)
print(current_line)
我得到了这个结果:
[u'ajax.preload(\'al_video.php\', {"act":"show","video":"-63758929_456249306","module":"direct"}, ["\u041d\u0435\u043c\u043d\u043e\u0433\u043e \u043f\u043e\u0442\u0430\u0441\u043a\u0443\u0445\u0430","<div id=\\"video_box_wrap-63758929_456249306\\" class=\\"video_box_wrap\\">\\n <video id=\\"video_player\\" poster=\\"https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg\\" preload=\\"none\\" controls onplaying=\\"cur.incViews && cur.incViews()\\">\\n <source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.720.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.480.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.360.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.240.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source>\\n <div class=\\"video_box_background\\" style=\\"background-image:url(https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg);\\"><\\/div>\\n <div class=\\"video_box_cant_play\\">\u0414\u0430\u043d\u043d\u043e\u0435 \u0432\u0438\u0434\u0435\u043e \u043d\u0435 \u043c\u043e\u0436\u0435\u0442 \u0431\u044b\u0442\u044c \u043f\u0440\u043e\u0438\u0433\u0440\u0430\u043d\u043e \u043d\u0430 \u044d\u0442\u043e\u043c \u0443\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u0435<\\/div>\\n <\\/video>\\n<\\/div>","\\naddTemplates({\\"_\\":\\"_\\",\\"audio_row\\":\\"<div class=\\\\\\"audio_row _audio_row _audio_row_%1%_%0% %cls% clear_fix\\\\\\" onclick=\\\\\\"return getAudioPlayer().toggleAudio(this, event)\\\\\\" data-audio=\\\\\\"%serialized%\\\\\\" data-full-id=\\\\\\"%1%_%0%\\\\\\" id=\\\\\\"audio_%1%_%0%\\\\\\">\\\\n <div class=\\\\\\"audio_play_wrap\\\\\\" data-nodrag=\\\\\\"1\\\\\\"><button class=\\\\\\"audio_play _audio_play\\\\\\" id=\\\\\\"play_%1%_%0%\\\\\\" aria-label=\\\\\\"\\\\\\"><\\\\\\/button><\\\\\\/div>\\\\n <div class=\\\\\\"audio_info\\\\\\">\\\\n <div class=\\\\\\"audio_duration_wrap _audio_duration_wrap\\\\\\">\\\\n <div class=\\\\\\"audio_hq_label\\\\\\"><\\\\\\/div>\\\\n <div class=\\\\\\"audio_duration _audio_duration\\\\\\">%duration%<\\\\\\/div>\\\\n <div class=\\\\\\"audio_acts\\\\\\">\\\\n <div class=\\\\\\"audio_act\\\\\\" id=\\\\\\"recom\\\\\\" onmouseover=\\\\\\"audioShowActionTooltip(this, \'%1%_%0%\')\\\\\\" onclick=\\\\\\"AudioPage(this).showRecoms(this, \'%1%_%0%\', event)\\\\\\"><div><\\\\\\/div><\\\\\\/d
...
但我只需要我的4个链接。我做错了什么?如何只从这个大标签获取链接?
答案 0 :(得分:2)
我将结果作为字符串包含在内,并添加了Regex来提取网址。
正则表达式:
(?<=src\=\\\")(https:\\\/\\\/c[\s\S]*?mp4)
正则表达式演示:https://regex101.com/r/GDMBqH/2
在python中使用Regex时,无需转义\
Python代码:
import re
results = '''[u'ajax.preload(\'al_video.php\', {"act":"show","video":"-63758929_456249306","module":"direct"}, ["\u041d\u0435\u043c\u043d\u043e\u0433\u043e \u043f\u043e\u0442\u0430\u0441\u043a\u0443\u0445\u0430","<div id=\\"video_box_wrap-63758929_456249306\\" class=\\"video_box_wrap\\">\\n <video id=\\"video_player\\" poster=\\"https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg\\" preload=\\"none\\" controls onplaying=\\"cur.incViews && cur.incViews()\\">\\n <source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.720.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510400.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.480.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.360.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source><source src=\\"https:\\/\\/cs510603.vk.me\\/3\\/u381845574\\/videos\\/e8f1419d5b.240.mp4?extra=hX6mAywSWUELjj_xFMO6YyRMX8rqNcm193BCOJzcZxIFlwHMD5ApeTI1DW9euordOFWDVq3m4ii7OAsbPKO8y901BjBcWz7nv5U-dKOz6i69zJNmaAeQiNEezDylB3s\\" type=\\"video\\/mp4\\"><\\/source>\\n <div class=\\"video_box_background\\" style=\\"background-image:url(https:\\/\\/pp.vk.me\\/c836534\\/v836534929\\/16e40\\/DWpFw6tiZDQ.jpg);\\"><\\/div>\\n <div class=\\"video_box_cant_play\\">\u0414\u0430\u043d\u043d\u043e\u0435 \u0432\u0438\u0434\u0435\u043e \u043d\u0435 \u043c\u043e\u0436\u0435\u0442 \u0431\u044b\u0442\u044c \u043f\u0440\u043e\u0438\u0433\u0440\u0430\u043d\u043e \u043d\u0430 \u044d\u0442\u043e\u043c \u0443\u0441\u0442\u0440\u043e\u0439\u0441\u0442\u0432\u0435<\\/div>\\n <\\/video>\\n<\\/div>","\\naddTemplates({\\"_\\":\\"_\\",\\"audio_row\\":\\"<div class=\\\\\\"audio_row _audio_row _audio_row_%1%_%0% %cls% clear_fix\\\\\\" onclick=\\\\\\"return getAudioPlayer().toggleAudio(this, event)\\\\\\" data-audio=\\\\\\"%serialized%\\\\\\" data-full-id=\\\\\\"%1%_%0%\\\\\\" id=\\\\\\"audio_%1%_%0%\\\\\\">\\\\n <div class=\\\\\\"audio_play_wrap\\\\\\" data-nodrag=\\\\\\"1\\\\\\"><button class=\\\\\\"audio_play _audio_play\\\\\\" id=\\\\\\"play_%1%_%0%\\\\\\" aria-label=\\\\\\"\\\\\\"><\\\\\\/button><\\\\\\/div>\\\\n <div class=\\\\\\"audio_info\\\\\\">\\\\n <div class=\\\\\\"audio_duration_wrap _audio_duration_wrap\\\\\\">\\\\n <div class=\\\\\\"audio_hq_label\\\\\\"><\\\\\\/div>\\\\n <div class=\\\\\\"audio_duration _audio_duration\\\\\\">%duration%<\\\\\\/div>\\\\n <div class=\\\\\\"audio_acts\\\\\\">\\\\n <div class=\\\\\\"audio_act\\\\\\" id=\\\\\\"recom\\\\\\" onmouseover=\\\\\\"audioShowActionTooltip(this, \'%1%_%0%\')\\\\\\" onclick=\\\\\\"AudioPage(this).showRecoms(this, \'%1%_%0%\', event)\\\\\\"><div><\\\\\\/div><\\\\\\/d'''
for m in re.finditer(r"(https:\\/\\/c[\s\S]*?mp4)", results):
print('%s' % (m.group(0)))