我正在创建一个网络抓取工具,并试图请求多个具有相同URL路径的URL,但编号的ID除外。
我抓取一个网址的代码如下:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
除了公司编号以外,该网址具有相同的结构。我已经尝试使用以下代码尝试将其刮到多个页面,但是没有成功:
import requests
from bs4 import BeautifulSoup as bs
pages = []
for i in range(11003058, 11003059, 00930291):
```url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
```pages.append(url)
for item in pages:
```page = requests.get(item)
```soup = bs(page.text, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
这只是给我第一页(/ 11003058 /官员),为什么不循环浏览它们?有人可以帮忙吗?
答案 0 :(得分:1)
那应该可以解决您的问题:
range()函数返回一个数字序列,默认情况下从0开始,并且 递增1(默认情况下),并以指定的数字结束。
语法:
range(start, stop, step)
https://docs.python.org/3/library/functions.html#func-range
将代码替换为:
company_id = ["11003058","11003059","00930291"]
for i in company_id:
url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
pages.append(url)
您应在页面迭代之前将汤初始化为列表:
汤= []
并在汤列表中添加:
for item in pages:
page = requests.get(item)
soup.append(bs(page.text, 'lxml'))
打印名称列表:
names = []
for items in soup:
h2Obj = items.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')
for i in h2Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'a':
names.append(tag.text)
O / P:
['MASRAT, Suheel', 'MARSHALL, Jack', 'SUTTON, Tim', 'COOMBES, John Frederick', 'BROWN, Alistair Stuart', 'COOMBES, Kenneth', 'LAFONT, Jean-Jacques Mathieu', 'THOMAS-KEEPING, Lindsay Charles', 'WILLIAMS, Janet Elizabeth', 'WILLIAMS, Roderick', 'WRAGG, Barry']
添加脚本顶部:
来自bs4.element导入标签
答案 1 :(得分:0)
range
的语法为range(start, stop, step)
。它从start
到stop - 1
循环,并每次增加step
。您在这里做的事情很奇怪,因为在您的情况下,stop
等于start + 1
,因此它只会使用start
值循环一次。
我想您只想获取这3个网址:
for i in (11003058, 11003059, 00930291):
答案 2 :(得分:0)
循环范围:循环在迭代过程中始终包含start_value并排除end_value
尝试一下:
import requests
from bs4 import BeautifulSoup as bs
pages = ['11003058', '11003059', '00930291']
i=0
while i<len(pages):
url = 'https://beta.companieshouse.gov.uk/company/' + pages(i) + '/officers'
pages.append(url)
i+1
for item in pages:
page = requests.get(item)
soup = bs(page.text, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)