我正在尝试解析html页面,但我需要在解析页面之前过滤结果。
例如,'http://www.ksl.com/index.php?nid=443'是犹他州的汽车分类列表。我不想解析所有汽车,而是首先过滤它(即找到所有宝马),然后只解析那些页面。是否可以用python填写javascript表单?
这是我到目前为止所拥有的:
import urllib
content = urllib.urlopen('http://www.ksl.com/index.php?nid=443').read()
f = open('/var/www/bmw.html',"w")
f.write(content)
f.close()
答案 0 :(得分:2)
这是做到这一点的方法。首先下载页面,抓取它以找到您正在寻找的模型,然后您可以获取新页面的链接以进行刮擦。这里不需要javascript。这个模型和BeautifulSoup文档将帮助你。
from BeautifulSoup import BeautifulSoup
import urllib2
base_url = 'http://www.ksl.com'
url = base_url + '/index.php?nid=443'
model = "Honda" # this is the name of the model to look for
# Load the page and process with BeautifulSoup
handle = urllib2.urlopen(url)
html = handle.read()
soup = BeautifulSoup(html)
# Collect all the ad detail boxes from the page
divs = soup.findAll(attrs={"class" : "detailBox"})
# For each ad, get the title
# if it contains the word "Honda", get the link
for div in divs:
title = div.find(attrs={"class" : "adTitle"}).text
if model in title:
link = div.find(attrs={"class" : "listlink"})["href"]
link = base_url + link
# Now you have a link that you can download and scrape
print title, link
else:
print "No match: ", title
在回答的那一刻,这段代码片段正在寻找本田车型并返回以下内容:
1995- Honda Prelude http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817797
No match: 1994- Ford Escort
No match: 2006- Land Rover Range Rover Sport
No match: 2006- Nissan Maxima
No match: 1957- Volvo 544
No match: 1996- Subaru Legacy
No match: 2005- Mazda Mazda6
No match: 1995- Chevrolet Monte Carlo
2002- Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817784
No match: 2004- Chevrolet Suburban (Chevrolet)
1998- Honda Civic http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817779
No match: 2004- Nissan Titan
2001- Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817770
No match: 1999- GMC Yukon
No match: 2007- Toyota Tacoma
答案 1 :(得分:-1)
如果你正在使用python,Beautifull Soup就是你要找的东西。