循环不进入下一页

时间:2014-08-12 16:52:04

标签: python html lxml

我刚刚制作了一个python脚本,可以通过律师资料来查看他们的详细信息。它适用于第一页,但循环不会进入第二页。该脚本仅从第一页抓取数据。我想刮掉所有页面。请帮帮我,我是python的新手。

以下是代码:

import requests

from lxml import html

root_url = 'http://lawyerlist.com.au/'

def get_page_urls(): 
  for no in ('1','2'):  
    page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)   
    tree = html.fromstring(page.text)
    return (tree.xpath('//td/a/@href'))

for li in (get_page_urls()):
  pag=requests.get(root_url + li) 
  doc = html.fromstring(pag.text)
  for name in doc.xpath('//tr/td/h1/text()'):
    print(name)

2 个答案:

答案 0 :(得分:0)

问题是for no in ('1', '2'):

中的返回

一旦它返回此返回,它将停止运行循环并退出该函数。您可以将tree.xpath('//td/a/@href')附加到列表中,然后在for循环之外返回列表。

类似的东西:

def get_page_urls():
  all_trees = []
  for no in ('1','2'):  
    page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)   
    tree = html.fromstring(page.text)
    all_trees.append(tree.xpath('//td/a/@href'))
  return all_trees

答案 1 :(得分:0)

由于for循环中的return语句,get_page_urls函数只返回第一页的url。使用yield语句将函数转换为生成器,然后迭代每个URL页面,如下所示:

import requests

from lxml import html

root_url = 'http://lawyerlist.com.au/'

def get_page_urls(): 
  for no in ('1','2'):  
    page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)   
    tree = html.fromstring(page.text)
    yield tree.xpath('//td/a/@href')

for page_of_urls in get_page_urls():
  for li in page_of_urls:
    pag=requests.get(root_url + li) 
    doc = html.fromstring(pag.text)
    for name in doc.xpath('//tr/td/h1/text()'):
      print(name)