我正在尝试在python中编写一个代码来搜索html代码中的图像链接,我需要找到的代码 - 。我需要找到http://www.darlighting.co.uk/621-large_default/empire-double-wall-bracket-polished-chrome.jpg部分而不管链接实际上是什么,无论如何要做到这一点还是我应该研究一种不同的方法?我可以访问标准的python模块和beautifulsoup。
答案 0 :(得分:0)
您可以尝试使用lxml(http://lxml.de/)和xpath(http://en.wikipedia.org/wiki/XPath)
例如在html中找到你可以
的图像import lxml.html
import requests
html = requests.get('http://www.google.com/').text
doc = lxml.html.document_fromstring(html)
images = doc.xpath('//img') # here you can find the element in your case the image
if images:
print images[0].get('src') # here I get the src from the first img
else:
print "Images not found"
我希望这可以帮助你。
更新:在没有“:”
之前我修复了其他内容答案 1 :(得分:0)
美丽的汤文档有很好的“快速入门”部分:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
from bs4 import BeautifulSoup as Soup
from urllib import urlopen
url = "http://www.darlighting.co.uk/"
html = urlopen(url).read()
soup = Soup(html)
# find image tag with specific source
the_image_tag = soup.find("img", src='/images/dhl_logo.png')
print type(the_image_tag), the_image_tag
# >>> <class 'bs4.element.Tag'> <img src="/images/dhl_logo.png"/>
# find all image tags
img_tags = soup.find_all("img")
for img_tag in img_tags:
print img_tag['src']
答案 2 :(得分:0)
import httplib
from lxml import html
#CONNECTION
url = "www.darlighting.co.uk"
path = "/"
conn = httplib.HTTPConnection(url)
conn.putrequest("GET", path)
#HERE YOU HEADERS...
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)", "Cache-Control": "no-cache"}
for k, v in header.iteritems():
conn.putheader(k, v)
conn.endheaders()
res = conn.getresponse()
if res.status == 200:
source = res.read()
else:
print res.status
print res.getheaders()
#EXTRACT
dochtml = html.fromstring(source)
for elem, att, link, pos in dochtml.iterlinks():
if att == 'src': #or 'href'
print 'elem: {0} || pos {1}: || attr: {2} || link: {3}'.format(elem, pos, att, link)