我刚刚制作了一个python脚本,可以通过律师资料来查看他们的详细信息。它适用于第一页,但循环不会进入第二页。该脚本仅从第一页抓取数据。我想刮掉所有页面。请帮帮我,我是python的新手。
以下是代码:
import requests
from lxml import html
root_url = 'http://lawyerlist.com.au/'
def get_page_urls():
for no in ('1','2'):
page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)
tree = html.fromstring(page.text)
return (tree.xpath('//td/a/@href'))
for li in (get_page_urls()):
pag=requests.get(root_url + li)
doc = html.fromstring(pag.text)
for name in doc.xpath('//tr/td/h1/text()'):
print(name)
答案 0 :(得分:0)
问题是for no in ('1', '2'):
一旦它返回此返回,它将停止运行循环并退出该函数。您可以将tree.xpath('//td/a/@href')
附加到列表中,然后在for循环之外返回列表。
类似的东西:
def get_page_urls():
all_trees = []
for no in ('1','2'):
page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)
tree = html.fromstring(page.text)
all_trees.append(tree.xpath('//td/a/@href'))
return all_trees
答案 1 :(得分:0)
由于for循环中的return语句,get_page_urls
函数只返回第一页的url。使用yield语句将函数转换为生成器,然后迭代每个URL页面,如下所示:
import requests
from lxml import html
root_url = 'http://lawyerlist.com.au/'
def get_page_urls():
for no in ('1','2'):
page = requests.get('http://lawyerlist.com.au/lawyers.aspx?city=Sydney&Page=' + no)
tree = html.fromstring(page.text)
yield tree.xpath('//td/a/@href')
for page_of_urls in get_page_urls():
for li in page_of_urls:
pag=requests.get(root_url + li)
doc = html.fromstring(pag.text)
for name in doc.xpath('//tr/td/h1/text()'):
print(name)