Question

我刚刚制作了一个python脚本，可以通过律师资料来查看他们的详细信息。它适用于第一页，但循环不会进入第二页。该脚本仅从第一页抓取数据。我想刮掉所有页面。请帮帮我，我是python的新手。

以下是代码：

import requests

from lxml import html

root_url = 'http://lawyerlist.com.au/'

def get_page_urls(): 
  for no in ('1','2'):  
    page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)   
    tree = html.fromstring(page.text)
    return (tree.xpath('//td/a/@href'))

for li in (get_page_urls()):
  pag=requests.get(root_url + li) 
  doc = html.fromstring(pag.text)
  for name in doc.xpath('//tr/td/h1/text()'):
    print(name)

Answer 1

问题是for no in ('1', '2'):

中的返回

一旦它返回此返回，它将停止运行循环并退出该函数。您可以将tree.xpath('//td/a/@href')附加到列表中，然后在for循环之外返回列表。

类似的东西：

def get_page_urls():
  all_trees = []
  for no in ('1','2'):  
    page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)   
    tree = html.fromstring(page.text)
    all_trees.append(tree.xpath('//td/a/@href'))
  return all_trees

Answer 2

由于for循环中的return语句，get_page_urls函数只返回第一页的url。使用yield语句将函数转换为生成器，然后迭代每个URL页面，如下所示：

import requests

from lxml import html

root_url = 'http://lawyerlist.com.au/'

def get_page_urls(): 
  for no in ('1','2'):  
    page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)   
    tree = html.fromstring(page.text)
    yield tree.xpath('//td/a/@href')

for page_of_urls in get_page_urls():
  for li in page_of_urls:
    pag=requests.get(root_url + li) 
    doc = html.fromstring(pag.text)
    for name in doc.xpath('//tr/td/h1/text()'):
      print(name)

循环不进入下一页

2 个答案: