我正在使用lxml来浏览网站。我想搜索一下包含194个项目的搜索结果。我的刮刀只能抓取搜索结果的第一页。我怎样才能刮掉其余的搜索结果?
url = 'http://www.alotofcars.com/new_car_search.php?pg=1&byshowroomprice=0.5-500&bycity=Gotham'
response_object = requests.get(url)
# Build DOM tree
dom_tree = html.fromstring(response_object.text)
此后还有抓取功能
def enter_mmv_in_database(dom_tree,engine):
# Getting make, model, variant
name_selector = CSSSelector('[class="secondary-cell"] p a')
name_results = name_selector(dom_tree)
for n in name_results:
mmv = str(`n.text_content()`).split('\\xa0')
make,model,variant = mmv[0][2:], mmv[1], mmv[2][:-2]
# Now push make, model, variant in Database
print make,model,variant
通过查看我收到的列表,我可以看到只解析了搜索结果的第一页。如何解析整个搜索结果。
答案 0 :(得分:1)
我尝试浏览该网站,但似乎处于离线状态。然而,我想帮助解决逻辑。
我通常做的是:
循环从第一页到最后一页,发出请求并抓取所需数据:
for page_number in range(1, last+1):
## make requests replacing 'page_number' in 'pg' GET variable
url = "http://www.alotofcars.com/new_car_search.php?pg={}&byshowroomprice=0.5-500&bycity=Gotham'".format(page_number)
response_object = requests.get(url)
dom_tree = html.fromstring(response_object.text)
...
...
我希望这会有所帮助。如果您有任何其他问题,请与我们联系。