python在抓取时解析html元素

时间:2013-11-19 19:14:22

标签: python web-scraping beautifulsoup

我有一个网站:

http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1

我希望获得广告的所有名称以及数组中项目的值,我现在拥有的是:

import urllib2
from BeautifulSoup import BeautifulSoup
import re


listofads = []

page = urllib2.urlopen("http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1").read()
soup = BeautifulSoup(page)
for a in soup.findAll("div", {"class":re.compile("lista")}):
            for i in a:
                c = soup.findAll('h2')
                y = soup.findAll("span", {"class":re.compile("right")})
                listofads.append(c)
                listofads.append(y)


print listofads

我得到的是这样的:

                      </h2>, <h2>
                          Procura:  Macbook Pro i7, 15'

                      </h2>], [<span class="right">50  &euro;</span>

看起来很糟糕....我想得到:

Macbook bla bla . price = 500
Macbook B . price = 600

等等

网站的html是这样的:

<div class="listofads">
<div class="lista " style="cursor: pointer;">
<div class="lista " style="cursor: pointer;">
<div class="li_image">
<div class="li_desc">
<a href="http://www.custojusto.pt/Lisboa/Laptops/Macbook+pro+15-11018054.htm?xtcr=2&" name="11018054">
<h2> Macbook pro 15 </h2>
</a>
<div class="clear"></div>
<span class="li_date largedate listline"> Informática & Acessórios - Loures </span>
<span class="li_date largedate listline">
</div>
<div class="li_categoria">
<span class="li_price">
<ul>
<li>
<span class="right">1 199 €</span>
<div class="clear"></div>
</li>
<li class="excep"> </li>
</ul>
</span>
</div>
<div class="clear"></div>
</div>

正如你所看到的那样,我只想要div上的H2值(文本)和“li_desc”类,以及来自类“span”的跨度的价格。

1 个答案:

答案 0 :(得分:0)

我不知道如何使用BeautifulSoup进行操作,因为它不支持xpath,但是这里有你如何使用lxml很好地完成它:

import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector

url =  "http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)

my_products = []
# Here, we harvet all the results into a list of dictionaries, containing the items we want.
for product_result in CSSSelector(u'div.lista')(tree):
    # Now, we can select the children element of each div.lista.
    this_product = {
        u'name': product_result.xpath('div[2]/a/h2'),  # first h2 of the second child div
        u'category': product_result.xpath('div[2]/span[1]'),  # first span of the second child div
        u'price': product_result.xpath('div[3]/span/ul/li[1]/span'),  # Third div, span, ul, first li, span tag.
    }
    print this_product.get(u'name')[0].text
    my_products.append(this_product)

# Let's inspect a product result now:
for product in my_products:
    print u'Product Name: "{0}", costs: "{1}"'.format(
        product.get(u'name')[0].text.replace(u'Procura:', u'').strip() if product.get(u'name') else 'NONAME!',
        product.get(u'price')[0].text.strip() if product.get(u'price') else u'NO PRICE!',
    )

而且,这里有一些输出:

Product Name: "Macbook Pro", costs: "890  €"
Product Name: "Memoria para Macbook Pro", costs: "50  €"
Product Name: "Macbook pro 15", costs: "1 199  €"
Product Name: "Macbook Air 13", costs: "1 450  €"

有些商品不含价格,因此在输出每种商品之前都需要先检查结果。