使用BeautifulSoup获取属性的值

时间:2013-09-11 05:03:22

标签: python python-2.7 beautifulsoup

我正在编写一个python脚本,它将在从网页解析后提取脚本位置。 可以说有两种情况:

<script type="text/javascript" src="http://example.com/something.js"></script>

<script>some JS</script>

我能够从第二个场景中获取JS,即在标签内编写JS时。

但是有什么办法,我可以从第一个场景中获取src的值(即在脚本中提取src标签的所有值,例如http://example.com/something.js

这是我的代码

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print n 

输出:有些JS

需要输出http://example.com/something.js

3 个答案:

答案 0 :(得分:22)

仅当它们存在时才会获得所有src值。否则它会跳过<script>标记

from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
 print source['src']

我得到了两个src值作为结果

http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js

我想这就是你想要的。希望这很有用。

答案 1 :(得分:5)

从脚本节点获取'src'。

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print "src:", n.get('src') <==== 

答案 2 :(得分:1)

这应该可行,您只需过滤以查找所有脚本标记,然后确定它们是否具有'src'属性。如果他们这样做,那么javascript的URL包含在src属性中,否则我们假设javascript在标记中

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

# Test HTML which has both cases
html = '<script type="text/javascript" src="http://example.com/something.js">'
html += '</script>  <script>some JS</script>'

soup = BeautifulSoup(html)

# Find all script tags 
for n in soup.find_all('script'):

    # Check if the src attribute exists, and if it does grab the source URL
    if 'src' in n.attrs:
        javascript = n['src']

    # Otherwise assume that the javascript is contained within the tags
    else:
        javascript = n.text

    print javascript

这个输出是

http://example.com/something.js
some JS