我有一个网站:
http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1
我希望获得广告的所有名称以及数组中项目的值,我现在拥有的是:
import urllib2
from BeautifulSoup import BeautifulSoup
import re
listofads = []
page = urllib2.urlopen("http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1").read()
soup = BeautifulSoup(page)
for a in soup.findAll("div", {"class":re.compile("lista")}):
for i in a:
c = soup.findAll('h2')
y = soup.findAll("span", {"class":re.compile("right")})
listofads.append(c)
listofads.append(y)
print listofads
我得到的是这样的:
</h2>, <h2> Procura: Macbook Pro i7, 15' </h2>], [<span class="right">50 €</span>
看起来很糟糕....我想得到:
Macbook bla bla . price = 500
Macbook B . price = 600
等等
网站的html是这样的:
<div class="listofads">
<div class="lista " style="cursor: pointer;">
<div class="lista " style="cursor: pointer;">
<div class="li_image">
<div class="li_desc">
<a href="http://www.custojusto.pt/Lisboa/Laptops/Macbook+pro+15-11018054.htm?xtcr=2&" name="11018054">
<h2> Macbook pro 15 </h2>
</a>
<div class="clear"></div>
<span class="li_date largedate listline"> Informática & Acessórios - Loures </span>
<span class="li_date largedate listline">
</div>
<div class="li_categoria">
<span class="li_price">
<ul>
<li>
<span class="right">1 199 €</span>
<div class="clear"></div>
</li>
<li class="excep"> </li>
</ul>
</span>
</div>
<div class="clear"></div>
</div>
正如你所看到的那样,我只想要div上的H2值(文本)和“li_desc”类,以及来自类“span”的跨度的价格。
答案 0 :(得分:0)
我不知道如何使用BeautifulSoup
进行操作,因为它不支持xpath,但是这里有你如何使用lxml很好地完成它:
import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector
url = "http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
my_products = []
# Here, we harvet all the results into a list of dictionaries, containing the items we want.
for product_result in CSSSelector(u'div.lista')(tree):
# Now, we can select the children element of each div.lista.
this_product = {
u'name': product_result.xpath('div[2]/a/h2'), # first h2 of the second child div
u'category': product_result.xpath('div[2]/span[1]'), # first span of the second child div
u'price': product_result.xpath('div[3]/span/ul/li[1]/span'), # Third div, span, ul, first li, span tag.
}
print this_product.get(u'name')[0].text
my_products.append(this_product)
# Let's inspect a product result now:
for product in my_products:
print u'Product Name: "{0}", costs: "{1}"'.format(
product.get(u'name')[0].text.replace(u'Procura:', u'').strip() if product.get(u'name') else 'NONAME!',
product.get(u'price')[0].text.strip() if product.get(u'price') else u'NO PRICE!',
)
而且,这里有一些输出:
Product Name: "Macbook Pro", costs: "890 €"
Product Name: "Memoria para Macbook Pro", costs: "50 €"
Product Name: "Macbook pro 15", costs: "1 199 €"
Product Name: "Macbook Air 13", costs: "1 450 €"
有些商品不含价格,因此在输出每种商品之前都需要先检查结果。