Pars脚本用Beautifuo汤找到特殊价值

时间:2016-11-18 20:21:04

标签: python beautifulsoup

我有这个脚本:

var s1 = new SWFObject('/media/player/flvplayer.swf','single','400','300','7');s1.addParam('allowfullscreen','true');s1.addVariable('file','http://cdn.abc.con/video.flv');s1.addParam('menu','false');s1.addVariable('width','400');s1.addVariable('height','300');s1.write('player1474719921904');

我将获得视频网址值:

http://cdn.abc.con/video.flv

我试过这个,但是找不到:

scripts = soup.find_all("script")
        if scripts:
            for s in scripts:
                crawler_logger.info('s: %s' % s)
                l = s.find_all(attrs={'': re.compile(r'\.(flv|mp4)$')})

我希望能够获得这样的所有视频,而无需知道网址名称

2 个答案:

答案 0 :(得分:1)

BeautifulSoup不解析javascript。从脚本标记s中,将javascript代码解压缩为:

code = s.text

然后您可以使用正则表达式手动提取URL,如下所示:

import re

code = """var s1 = new SWFObject('/media/player/flvplayer.swf','single','400','300','7');s1.addParam('allowfullscreen','true');s1.addVariable('file','http://cdn.abc.con/video.flv');s1.addParam('menu','false');s1.addVariable('width','400');s1.addVariable('height','300');s1.write('player1474719921904');"""
url = re.search(r"['\"](https?://.+?\.flv)['\"]", code).group(1)
print(url)   # http://cdn.abc.con/video.flv

答案 1 :(得分:1)

    import re

    text = '''
    var s1 = new SWFObject('/media/player/flvplayer.swf','single','400','300','7');s1.addParam('allowfullscreen','true');s1.addVariable('file','http://cdn.abc.con/video.flv');s1.addParam('menu','false');s1.addVariable('width','400');s1.addVariable('height','300');s1.write('player1474719921904');
    var s1 = new SWFObject('/media/player/flvplayer.swf','single','400','300','7');s1.addParam('allowfullscreen','true');s1.addVariable('file','http://cdn.abc.con/video.flv');s1.addParam('menu','false');s1.addVariable('width','400');s1.addVariable('height','300');s1.write('player1474719921904');
    var s1 = new SWFObject('/media/player/flvplayer.swf','single','400','300','7');s1.addParam('allowfullscreen','true');s1.addVariable('file','http://cdn.abc.con/video.flv');s1.addParam('menu','false');s1.addVariable('width','400');s1.addVariable('height','300');s1.write('player1474719921904');
    '''
    link = re.findall(r"'(http.+?)'", text)
    print(link)

出:

['http://cdn.abc.con/video.flv', 'http://cdn.abc.con/video.flv', 'http://cdn.abc.con/video.flv']

这个正则表达式将找到所有链接,并将它们放在列表中