通过多个网址进行网页爬取

时间:2019-08-13 18:16:38

标签: python

我试图在一个脚本中循环访问多个网页。但是,它只会从列表中的最后一个URL提取数据

这是我当前的代码:

from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package 
import requests

URLS = ['https://sc2replaystats.com/replay/playerStats/11116819/1809336', 'https://sc2replaystats.com/replay/playerStats/11116819/1809336']

for URL in URLS:
  response = requests.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')

tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
    name = link.find('span')
    if name is not None:
        print(name['title'])

结果是:

Commandcenter
Supplydepot
Barracks
Refinery
Orbitalcommand
Commandcenter
Barracksreactor
Supplydepot
Factory
Refinery
Factorytechlab
Orbitalcommand
Starport
Bunker
Supplydepot
Supplydepot
Starporttechlab
Supplydepot
Barracks
Refinery
Supplydepot
Barracks
Engineeringbay
Refinery
Starportreactor
Factorytechlab
Supplydepot
Barracks
Supplydepot
Supplydepot
Supplydepot
Supplydepot
Supplydepot
Commandcenter
Barrackstechlab
Barracks
Barracks
Engineeringbay
Supplydepot
Barracksreactor
Barracksreactor
Supplydepot
Armory
Supplydepot
Supplydepot
Supplydepot
Orbitalcommand
Factory
Refinery
Refinery
Supplydepot
Factoryreactor
Supplydepot
Commandcenter
Barracks
Barrackstechlab
Planetaryfortress
Supplydepot
Supplydepot

当我期望的时候:

Nexus
Pylon
Gateway
Assimilator
Cyberneticscore
Pylon
Assimilator
Nexus
Roboticsfacility
Pylon
Shieldbattery
Gateway
Gateway
Commandcenter
Supplydepot
Barracks
Refinery
Orbitalcommand
Commandcenter
Barracksreactor
Supplydepot
Factory
Refinery
Factorytechlab
Orbitalcommand
Starport
Bunker
Supplydepot
Supplydepot
Starporttechlab
Supplydepot
Barracks
Refinery
Supplydepot
Barracks
Engineeringbay
Refinery
Starportreactor
Factorytechlab
Supplydepot
Barracks
Supplydepot
Supplydepot
Supplydepot
Supplydepot
Supplydepot
Commandcenter
Barrackstechlab
Barracks
Barracks
Engineeringbay
Supplydepot
Barracksreactor
Barracksreactor
Supplydepot
Armory
Supplydepot
Supplydepot
Supplydepot
Orbitalcommand
Factory
Refinery
Refinery
Supplydepot
Factoryreactor
Supplydepot
Commandcenter
Barracks
Barrackstechlab
Planetaryfortress
Supplydepot
Supplydepot

1 个答案:

答案 0 :(得分:1)

遵循@RomanPerekhrest所说的,在for循环中,

for URL in URLS:
  response = requests.get(URL) 

这意味着您每次都覆盖响应。为了解决这个问题,一种方法是创建一个名为response的数组,并将响应附加到数组中,就像这样

responses = []
for URL in URLS:
  response = requests.get(URL) 
  responses.append(response)
for response in responses: 
  soup = BeautifulSoup(response.content, 'html.parser')

  tb = soup.find('table', class_='table table-striped table-condensed')
  for link in tb.find_all('tr'):
    name = link.find('span')
    if name is not None:
        print(name['title'])

相关问题