在Python抓取脚本中请求多个网址

时间:2019-05-13 10:39:41

标签: python python-3.x loops url request

我正在创建一个网络抓取工具,并试图请求多个具有相同URL路径的URL,但编号的ID除外。

我抓取一个网址的代码如下:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)

除了公司编号以外,该网址具有相同的结构。我已经尝试使用以下代码尝试将其刮到多个页面,但是没有成功:

import requests
from bs4 import BeautifulSoup as bs

pages = []

for i in range(11003058, 11003059, 00930291):
```url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
```pages.append(url)

for item in pages:
```page = requests.get(item)
```soup = bs(page.text, 'lxml')

names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]

print(names)

这只是给我第一页(/ 11003058 /官员),为什么不循环浏览它们?有人可以帮忙吗?

3 个答案:

答案 0 :(得分:1)

那应该可以解决您的问题:

range()函数返回一个数字序列,默认情况下从0开始,并且 递增1(默认情况下),并以指定的数字结束。

语法:

 range(start, stop, step)

https://docs.python.org/3/library/functions.html#func-range

将代码替换为:

company_id = ["11003058","11003059","00930291"]

for i in company_id:
    url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
    pages.append(url)

您应在页面迭代之前将初始化为列表:

  

汤= []

并在汤列表中添加:

for item in pages:
  page = requests.get(item)
  soup.append(bs(page.text, 'lxml'))

打印名称列表:

names = []
for items in soup:
    h2Obj = items.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')
    for i in h2Obj:
        tagArray = i.findChildren()
        for tag in tagArray:
            if isinstance(tag,Tag) and tag.name in 'a':
                names.append(tag.text)

O / P:

['MASRAT, Suheel', 'MARSHALL, Jack', 'SUTTON, Tim', 'COOMBES, John Frederick', 'BROWN, Alistair Stuart', 'COOMBES, Kenneth', 'LAFONT, Jean-Jacques Mathieu', 'THOMAS-KEEPING, Lindsay Charles', 'WILLIAMS, Janet Elizabeth', 'WILLIAMS, Roderick', 'WRAGG, Barry']

添加脚本顶部:

来自bs4.element导入标签

答案 1 :(得分:0)

range的语法为range(start, stop, step)。它从startstop - 1循环,并每次增加step。您在这里做的事情很奇怪,因为在您的情况下,stop等于start + 1,因此它只会使用start值循环一次。

我想您只想获取这3个网址:

for i in (11003058, 11003059, 00930291):

答案 2 :(得分:0)

循环范围:循环在迭代过程中始终包含start_value并排除end_value

尝试一下:

import requests
from bs4 import BeautifulSoup as bs

pages = ['11003058', '11003059', '00930291']
i=0
while i<len(pages):
  url = 'https://beta.companieshouse.gov.uk/company/' + pages(i) + '/officers'
  pages.append(url)
  i+1

for item in pages:
  page = requests.get(item)
  soup = bs(page.text, 'lxml')

names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]

print(names)