Question

我正在尝试在python中编写一个代码来搜索html代码中的图像链接，我需要找到的代码 - 。我需要找到http://www.darlighting.co.uk/621-large_default/empire-double-wall-bracket-polished-chrome.jpg部分而不管链接实际上是什么，无论如何要做到这一点还是我应该研究一种不同的方法？我可以访问标准的python模块和beautifulsoup。

Answer 1

您可以尝试使用lxml（http://lxml.de/）和xpath（http://en.wikipedia.org/wiki/XPath）

例如在html中找到你可以

的图像

import lxml.html
import requests

html = requests.get('http://www.google.com/').text
doc = lxml.html.document_fromstring(html)
images = doc.xpath('//img') # here you can find the element in your case the image
if images:
    print images[0].get('src') # here I get the src from the first img
else:
    print "Images not found"

我希望这可以帮助你。

更新：在没有“：”

之前我修复了其他内容

Answer 2

美丽的汤文档有很好的“快速入门”部分：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

from bs4 import BeautifulSoup as Soup
from urllib import urlopen

url = "http://www.darlighting.co.uk/"
html = urlopen(url).read()
soup = Soup(html)

# find image tag with specific source
the_image_tag = soup.find("img", src='/images/dhl_logo.png')
print type(the_image_tag), the_image_tag
# >>> <class 'bs4.element.Tag'> <img src="/images/dhl_logo.png"/>

# find all image tags
img_tags = soup.find_all("img")
for img_tag in img_tags:
    print img_tag['src']

Answer 3

import httplib
from lxml import html

#CONNECTION
url = "www.darlighting.co.uk"
path = "/"
conn = httplib.HTTPConnection(url)
conn.putrequest("GET", path)
#HERE YOU HEADERS... 
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)", "Cache-Control": "no-cache"}
for k, v in header.iteritems():
    conn.putheader(k, v)
conn.endheaders()
res = conn.getresponse()

if res.status == 200:
    source = res.read()
else:
    print res.status
    print res.getheaders()

#EXTRACT
dochtml = html.fromstring(source)
for elem, att, link, pos in dochtml.iterlinks():
    if att == 'src': #or 'href'
        print 'elem: {0} || pos {1}: || attr: {2} || link: {3}'.format(elem, pos, att, link)

如何在html中搜索链接并使用python打印链接？

3 个答案: