Question

我正在尝试解析html页面，但我需要在解析页面之前过滤结果。

例如，'http://www.ksl.com/index.php?nid=443'是犹他州的汽车分类列表。我不想解析所有汽车，而是首先过滤它（即找到所有宝马），然后只解析那些页面。是否可以用python填写javascript表单？

这是我到目前为止所拥有的：

import urllib

content = urllib.urlopen('http://www.ksl.com/index.php?nid=443').read()
f = open('/var/www/bmw.html',"w")
f.write(content)
f.close()

Answer 1

这是做到这一点的方法。首先下载页面，抓取它以找到您正在寻找的模型，然后您可以获取新页面的链接以进行刮擦。这里不需要javascript。这个模型和BeautifulSoup文档将帮助你。

from BeautifulSoup import BeautifulSoup
import urllib2

base_url = 'http://www.ksl.com'
url = base_url + '/index.php?nid=443'
model = "Honda" # this is the name of the model to look for

# Load the page and process with BeautifulSoup
handle = urllib2.urlopen(url)
html = handle.read()
soup = BeautifulSoup(html)

# Collect all the ad detail boxes from the page
divs = soup.findAll(attrs={"class" : "detailBox"})

# For each ad, get the title
# if it contains the word "Honda", get the link
for div in divs:
    title = div.find(attrs={"class" : "adTitle"}).text
    if model in title:
        link = div.find(attrs={"class" : "listlink"})["href"]
        link = base_url + link
        # Now you have a link that you can download and scrape
        print title, link
    else:
        print "No match: ", title

在回答的那一刻，这段代码片段正在寻找本田车型并返回以下内容：

1995-  Honda Prelude http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817797
No match:  1994-  Ford Escort
No match:  2006-  Land Rover Range Rover Sport
No match:  2006-  Nissan Maxima
No match:  1957-  Volvo 544
No match:  1996-  Subaru Legacy
No match:  2005-  Mazda Mazda6
No match:  1995-  Chevrolet Monte Carlo
2002-  Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817784
No match:  2004-  Chevrolet Suburban (Chevrolet)
1998-  Honda Civic http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817779
No match:  2004-  Nissan Titan
2001-  Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817770
No match:  1999-  GMC Yukon
No match:  2007-  Toyota Tacoma

Answer 2

如果你正在使用python，Beautifull Soup就是你要找的东西。

用python填写javascript？

2 个答案: