我为了学习而编写了一些愚蠢的代码,但它并不适用于任何网站。 这是代码:
import urllib2, re
from BeautifulSoup import BeautifulSoup as Soup
class Founder:
def Find_all_links(self, url):
page_source = urllib2.urlopen(url)
a = page_source.read()
soup = Soup(a)
a = soup.findAll(href=re.compile(r'/.a\w+'))
return a
def Find_shortcut_icon (self, url):
a = self.Find_all_links(url)
b = ''
for i in a:
strre=re.compile('shortcut icon', re.IGNORECASE)
m=strre.search(str(i))
if m:
b = i["href"]
return b
def Save_icon(self, url):
url = self.Find_shortcut_icon(url)
print url
host = re.search(r'[0-9a-zA-Z]{1,20}\.[a-zA-Z]{2,4}', url).group()
opener = urllib2.build_opener()
icon = opener.open(url).read()
file = open(host+'.ico', "wb")
file.write(icon)
file.close()
print '%s icon successfully saved' % host
c = Founder()
print c.Save_icon('http://lala.ru')
最奇怪的是它适用于网站: http://habrahabr.ru http://5pd.ru
但是对于我检查过的大多数人来说都不行。
答案 0 :(得分:11)
你使它变得比它需要的复杂得多。这是一个简单的方法:
import urllib
page = urllib.urlopen("http://5pd.ru/")
soup = BeautifulSoup(page)
icon_link = soup.find("link", rel="shortcut icon")
icon = urllib.urlopen(icon_link['href'])
with open("test.ico", "wb") as f:
f.write(icon.read())
答案 1 :(得分:1)
谢谢你,库尔德。以下是包含一些更改的代码:
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.facebook.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
icon_link = soup.find("link", rel="shortcut icon")
try:
icon = urllib2.urlopen(icon_link['href'])
except:
icon = urllib2.urlopen(url + icon_link['href'])
iconname = url.split(r'/')
iconname = iconname[2].split('.')
iconname = iconname[1] + '.' + iconname[2] + '.ico'
with open(iconname, "wb") as f:
f.write(icon.read())
答案 2 :(得分:0)
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://5pd.ru/")
soup = BeautifulSoup(page.read())
icon_link = soup.find("link", rel="shortcut icon")
icon = urllib2.urlopen(icon_link['href'])
with open("test.ico", "wb") as f:
f.write(icon.read())
答案 3 :(得分:0)
Thomas K的答案使我朝着正确的方向开始,但是我发现有些网站没有说rel =“ shortcut icon”,例如1800contacts.com只是说rel =“ icon”。这在Python 3中有效,并返回链接。您可以根据需要将其写入文件。
from bs4 import BeautifulSoup
import requests
def getFavicon(domain):
if 'http' not in domain:
domain = 'http://' + domain
page = requests.get(domain)
soup = BeautifulSoup(page.text, features="lxml")
icon_link = soup.find("link", rel="shortcut icon")
if icon_link is None:
icon_link = soup.find("link", rel="icon")
if icon_link is None:
return domain + '/favicon.ico'
return icon_link["href"]
答案 4 :(得分:0)
如果有人想对正则表达式使用单张支票,那么以下对我有用:
import re
from bs4 import BeautifulSoup
html_code = "<Some HTML code you get from somewhere>"
soup = BeautifulSoup(html_code, features="lxml")
for item in soup.find_all('link', attrs={'rel': re.compile("^(shortcut icon|icon)$", re.I)}):
print(item.get('href'))
这还将说明是否区分大小写。