所以我想在这个页面(nba团队)上获得所有图片。 http://www.cbssports.com/nba/draft/mock-draft
然而,我的代码提供了更多。它给了我,
<a href="/nba/teams/page/ORL"><img src="http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png" alt="Orlando Magic" width="30" height="30" border="0" /></a>
如何缩短它只给我http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png.
我的代码:
import urllib2
from BeautifulSoup import BeautifulSoup
# or if your're using BeautifulSoup4:
# from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())
rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:]
for row in rows:
fields = row.findAll("td")
if len(fields) >= 3:
anchor = row.findAll("td")[1].find("a")
if anchor:
print anchor
答案 0 :(得分:1)
要保存http://www.cbssports.com/nba/draft/mock-draft上的所有图片,
import urllib2
import os
from BeautifulSoup import BeautifulSoup
URL = "http://www.cbssports.com/nba/draft/mock-draft"
default_dir = os.path.join(os.path.expanduser("~"),"Pictures")
opener = urllib2.build_opener()
urllib2.install_opener(opener)
soup = BeautifulSoup(urllib2.urlopen(URL).read())
imgs = soup.findAll("img",{"alt":True, "src":True})
for img in imgs:
img_url = img["src"]
filename = os.path.join(default_dir, img_url.split("/")[-1])
img_data = opener.open(img_url)
f = open(filename,"wb")
f.write(img_data.read())
f.close()
要在http://www.cbssports.com/nba/draft/mock-draft上保存任何特定图片, 使用
soup.find("img",{"src":"image_name_from_source"})
答案 1 :(得分:1)
我知道这可能是“创伤性的”,但是对于那些自动生成的页面,你只想抓住该死的图像并且永远不会回来,一个快速n-dirty正则表达式,采用所需的模式往往是我的选择(没有美丽的汤依赖是一个很大的优势):
import urllib, re
source = urllib.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()
## every image name is an abbreviation composed by capital letters, so...
for link in re.findall('http://sports.cbsimg.net/images/nba/logos/30x30/[A-Z]*.png', source):
print link
## the code above just prints the link;
## if you want to actually download, set the flag below to True
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
希望这有帮助!
答案 2 :(得分:1)
您可以使用此功能从url获取所有图像的列表。
#
#
# get_url_images_in_text()
#
# @param html - the html to extract urls of images from him.
# @param protocol - the protocol of the website, for append to urls that not start with protocol.
#
# @return list of imags url.
#
#
def get_url_images_in_text(html, protocol):
urls = []
all_urls = re.findall(r'((http\:|https\:)?\/\/[^"\' ]*?\.(png|jpg))', html, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
for url in all_urls:
if not url[0].startswith("http"):
urls.append(protocol + url[0])
else:
urls.append(url[0])
return urls
#
#
# get_images_from_url()
#
# @param url - the url for extract images url from him.
#
# @return list of images url.
#
#
def get_images_from_url(url):
protocol = url.split('/')[0]
resp = requests.get(url)
return get_url_images_in_text(resp.text, protocol)