用漂亮的汤解析后,原始网页上的链接丢失

时间:2019-03-24 14:53:32

标签: python web-scraping beautifulsoup

如果我的解释很基本,请原谅。我是python和漂亮汤的新手。

我正在尝试从以下网站提取数据:

https://valor.militarytimes.com/award/5?page=1

我想提取与网站上24个奖牌获得者相对应的链接。从Firefox检查器中可以看到,他们的链接中都带有“英雄”一词。但是,当我使用漂亮的汤来解析网站时,这些链接不会出现。

我曾尝试使用标准的html解析器和html5lib解析器,但没有一个显示与这些奖牌接收者相对应的链接。

page = requests.get('https://valor.militarytimes.com/award/5?page=1')
soup = BeautifulSoup(page.text, "html5lib")
for idx, link in enumerate(soup.find_all('a', href = True)):
        print(link)

上面的代码仅找到原始网站上的一些链接,尤其是没有对应于奖牌获得者的链接。甚至运行soup.prettify()也会显示这些链接不在解析的文本中。

我希望有一个简单的代码可以提取此网站上24个奖牌获得者的链接。

3 个答案:

答案 0 :(得分:1)

如果要避免使用硒,则有一种简单的方法来获取所需的数据。该页面通过向其后的url发送发布请求来加载数据,

https://valor.militarytimes.com/api/awards/5?page=1

这将发送一个json响应,然后将其用于使用JavaScript填充页面。您要做的就是使用python-requests发送相同的请求,然后从json响应中获取数据。

import requests
r=requests.post('https://valor.militarytimes.com/api/awards/5?page=1')
for item in r.json()['data']:
    name=item['recipient']['name']
    url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
    print(name,url)

输出:

EUGENE MCCARLEY https://valor.militarytimes.com/hero/500963
TIMOTHY KEENAN https://valor.militarytimes.com/hero/500962
JOHN THOMPSON https://valor.militarytimes.com/hero/500961
WALTER BORDEN https://valor.militarytimes.com/hero/500941
WILLIAM ROSE https://valor.militarytimes.com/hero/94465
YUKITAKA MIZUTARI https://valor.militarytimes.com/hero/94175
ALBERT MARTIN https://valor.militarytimes.com/hero/92498
FRANCIS CODY https://valor.militarytimes.com/hero/500944
JAMES O'KEEFFE https://valor.militarytimes.com/hero/500943
PHILLIP FLEMING https://valor.militarytimes.com/hero/500942
JOHN WANAMAKER https://valor.militarytimes.com/hero/314466
ROBERT CHILSON https://valor.militarytimes.com/hero/102316
CHRISTOPHER NELMS https://valor.militarytimes.com/hero/89255
SAMUEL BARNETT https://valor.militarytimes.com/hero/71533
ANDREW BYERS https://valor.militarytimes.com/hero/500938
ANDREW RUSSELL https://valor.militarytimes.com/hero/500937
****** CALDWELL https://valor.militarytimes.com/hero/500935
****** WALWRATH https://valor.militarytimes.com/hero/500934
****** MADSEN https://valor.militarytimes.com/hero/500933
****** NELSON https://valor.militarytimes.com/hero/500932
WILLIAM SOUKUP https://valor.militarytimes.com/hero/500931
BENJAMIN WILSON https://valor.militarytimes.com/hero/500930
ANDREW MARCKESANO https://valor.militarytimes.com/hero/500929
WAYNE KUNZ https://valor.militarytimes.com/hero/500927

我也取了名字。您只需要链接即可。

修改

要从多个页面获取网址,请使用此代码

import requests
list_of_urls=[]
last_page=9 #replace this with your last page
for i in range(1,last_page+1):
    r=requests.post('https://valor.militarytimes.com/api/awards/5?page={}'.format(i))
    for item in r.json()['data']:
        url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
        list_of_urls.append(url)
print(list_of_urls)

输出:

['https://valor.militarytimes.com/hero/500963', 'https://valor.militarytimes.com/hero/500962', 'https://valor.militarytimes.com/hero/500961', 'https://valor.militarytimes.com/hero/500941', 'https://valor.militarytimes.com/hero/94465', 'https://valor.militarytimes.com/hero/94175', 'https://valor.militarytimes.com/hero/92498', 'https://valor.militarytimes.com/hero/500944', 'https://valor.militarytimes.com/hero/500943', 'https://valor.militarytimes.com/hero/500942', 'https://valor.militarytimes.com/hero/314466', 'https://valor.militarytimes.com/hero/102316', 'https://valor.militarytimes.com/hero/89255', 'https://valor.militarytimes.com/hero/71533', 'https://valor.militarytimes.com/hero/500938', 'https://valor.militarytimes.com/hero/500937', 'https://valor.militarytimes.com/hero/500935', 'https://valor.militarytimes.com/hero/500934', 'https://valor.militarytimes.com/hero/500933', 'https://valor.militarytimes.com/hero/500932', 'https://valor.militarytimes.com/hero/500931', 'https://valor.militarytimes.com/hero/500930', 'https://valor.militarytimes.com/hero/500929', 'https://valor.militarytimes.com/hero/500927', 'https://valor.militarytimes.com/hero/500926', 'https://valor.militarytimes.com/hero/500925', 'https://valor.militarytimes.com/hero/500924', 'https://valor.militarytimes.com/hero/500923', 'https://valor.militarytimes.com/hero/500922', 'https://valor.militarytimes.com/hero/500921', 'https://valor.militarytimes.com/hero/500920', 'https://valor.militarytimes.com/hero/500919', 'https://valor.militarytimes.com/hero/500918', 'https://valor.militarytimes.com/hero/500917', 'https://valor.militarytimes.com/hero/500916', 'https://valor.militarytimes.com/hero/500915', 'https://valor.militarytimes.com/hero/500914', 'https://valor.militarytimes.com/hero/500913', 'https://valor.militarytimes.com/hero/500912', 'https://valor.militarytimes.com/hero/500911', 'https://valor.militarytimes.com/hero/500910', 'https://valor.militarytimes.com/hero/500909', 'https://valor.militarytimes.com/hero/500908', 'https://valor.militarytimes.com/hero/500907', 'https://valor.militarytimes.com/hero/500906', 'https://valor.militarytimes.com/hero/500905', 'https://valor.militarytimes.com/hero/500904', 'https://valor.militarytimes.com/hero/500903', 'https://valor.militarytimes.com/hero/500902', 'https://valor.militarytimes.com/hero/500901', 'https://valor.militarytimes.com/hero/500900', 'https://valor.militarytimes.com/hero/500899', 'https://valor.militarytimes.com/hero/500898', 'https://valor.militarytimes.com/hero/500897', 'https://valor.militarytimes.com/hero/500896', 'https://valor.militarytimes.com/hero/500895', 'https://valor.militarytimes.com/hero/500894', 'https://valor.militarytimes.com/hero/500893', 'https://valor.militarytimes.com/hero/500892', 'https://valor.militarytimes.com/hero/500891', 'https://valor.militarytimes.com/hero/500890', 'https://valor.militarytimes.com/hero/500889', 'https://valor.militarytimes.com/hero/500888', 'https://valor.militarytimes.com/hero/29160', 'https://valor.militarytimes.com/hero/106931', 'https://valor.militarytimes.com/hero/106375', 'https://valor.militarytimes.com/hero/94936', 'https://valor.militarytimes.com/hero/94928', 'https://valor.militarytimes.com/hero/94927', 'https://valor.militarytimes.com/hero/94926', 'https://valor.militarytimes.com/hero/94923', 'https://valor.militarytimes.com/hero/94777', 'https://valor.militarytimes.com/hero/94769', 'https://valor.militarytimes.com/hero/94711', 'https://valor.militarytimes.com/hero/94644', 'https://valor.militarytimes.com/hero/94571', 'https://valor.militarytimes.com/hero/94570', 'https://valor.militarytimes.com/hero/94494', 'https://valor.militarytimes.com/hero/94468', 'https://valor.militarytimes.com/hero/94454', 'https://valor.militarytimes.com/hero/94388', 'https://valor.militarytimes.com/hero/94358', 'https://valor.militarytimes.com/hero/94279', 'https://valor.militarytimes.com/hero/94275', 'https://valor.militarytimes.com/hero/94253', 'https://valor.militarytimes.com/hero/94251', 'https://valor.militarytimes.com/hero/94223', 'https://valor.militarytimes.com/hero/94222', 'https://valor.militarytimes.com/hero/94217', 'https://valor.militarytimes.com/hero/94211', 'https://valor.militarytimes.com/hero/94210', 'https://valor.militarytimes.com/hero/94195', 'https://valor.militarytimes.com/hero/94194', 'https://valor.militarytimes.com/hero/94173', 'https://valor.militarytimes.com/hero/94168', 'https://valor.militarytimes.com/hero/94055', 'https://valor.militarytimes.com/hero/93916', 'https://valor.militarytimes.com/hero/93847', 'https://valor.militarytimes.com/hero/93780', 'https://valor.militarytimes.com/hero/93779', 'https://valor.militarytimes.com/hero/93775', 'https://valor.militarytimes.com/hero/93774', 'https://valor.militarytimes.com/hero/93733', 'https://valor.militarytimes.com/hero/93722', 'https://valor.militarytimes.com/hero/93706', 'https://valor.militarytimes.com/hero/93551', 'https://valor.militarytimes.com/hero/93435', 'https://valor.militarytimes.com/hero/93407', 'https://valor.militarytimes.com/hero/93374', 'https://valor.militarytimes.com/hero/93277', 'https://valor.militarytimes.com/hero/93243', 'https://valor.militarytimes.com/hero/93193', 'https://valor.militarytimes.com/hero/92989', 'https://valor.militarytimes.com/hero/92972', 'https://valor.militarytimes.com/hero/92958', 'https://valor.militarytimes.com/hero/93923', 'https://valor.militarytimes.com/hero/90130', 'https://valor.militarytimes.com/hero/90128', 'https://valor.militarytimes.com/hero/89704', 'https://valor.militarytimes.com/hero/89703', 'https://valor.militarytimes.com/hero/89702', 'https://valor.militarytimes.com/hero/89701', 'https://valor.militarytimes.com/hero/89698', 'https://valor.militarytimes.com/hero/89673', 'https://valor.militarytimes.com/hero/89661', 'https://valor.militarytimes.com/hero/90127', 'https://valor.militarytimes.com/hero/89535', 'https://valor.militarytimes.com/hero/89493', 'https://valor.militarytimes.com/hero/89406', 'https://valor.militarytimes.com/hero/89405', 'https://valor.militarytimes.com/hero/89404', 'https://valor.militarytimes.com/hero/89261', 'https://valor.militarytimes.com/hero/89259', 'https://valor.militarytimes.com/hero/88805', 'https://valor.militarytimes.com/hero/88803', 'https://valor.militarytimes.com/hero/88789', 'https://valor.militarytimes.com/hero/88770', 'https://valor.militarytimes.com/hero/88766', 'https://valor.militarytimes.com/hero/88765', 'https://valor.militarytimes.com/hero/88719', 'https://valor.militarytimes.com/hero/88680', 'https://valor.militarytimes.com/hero/88679', 'https://valor.militarytimes.com/hero/88678', 'https://valor.militarytimes.com/hero/88658', 'https://valor.militarytimes.com/hero/88657', 'https://valor.militarytimes.com/hero/88616', 'https://valor.militarytimes.com/hero/88578', 'https://valor.militarytimes.com/hero/88551', 'https://valor.militarytimes.com/hero/88445', 'https://valor.militarytimes.com/hero/88366', 'https://valor.militarytimes.com/hero/88365', 'https://valor.militarytimes.com/hero/88045', 'https://valor.militarytimes.com/hero/88044', 'https://valor.militarytimes.com/hero/88013', 'https://valor.militarytimes.com/hero/88012', 'https://valor.militarytimes.com/hero/87986', 'https://valor.militarytimes.com/hero/87918', 'https://valor.militarytimes.com/hero/87909', 'https://valor.militarytimes.com/hero/87898', 'https://valor.militarytimes.com/hero/87830', 'https://valor.militarytimes.com/hero/88570', 'https://valor.militarytimes.com/hero/88568', 'https://valor.militarytimes.com/hero/88239', 'https://valor.militarytimes.com/hero/87792', 'https://valor.militarytimes.com/hero/87782', 'https://valor.militarytimes.com/hero/87677', 'https://valor.militarytimes.com/hero/87655', 'https://valor.militarytimes.com/hero/87523', 'https://valor.militarytimes.com/hero/87460', 'https://valor.militarytimes.com/hero/87292', 'https://valor.militarytimes.com/hero/87291', 'https://valor.militarytimes.com/hero/87288', 'https://valor.militarytimes.com/hero/87283', 'https://valor.militarytimes.com/hero/87282', 'https://valor.militarytimes.com/hero/87281', 'https://valor.militarytimes.com/hero/87280', 'https://valor.militarytimes.com/hero/87279', 'https://valor.militarytimes.com/hero/87272', 'https://valor.militarytimes.com/hero/86875', 'https://valor.militarytimes.com/hero/86811', 'https://valor.militarytimes.com/hero/86451', 'https://valor.militarytimes.com/hero/86077', 'https://valor.militarytimes.com/hero/86076', 'https://valor.militarytimes.com/hero/85994', 'https://valor.militarytimes.com/hero/86005', 'https://valor.militarytimes.com/hero/6190', 'https://valor.militarytimes.com/hero/5022', 'https://valor.militarytimes.com/hero/500877', 'https://valor.militarytimes.com/hero/500851', 'https://valor.militarytimes.com/hero/500844', 'https://valor.militarytimes.com/hero/500843', 'https://valor.militarytimes.com/hero/500842', 'https://valor.militarytimes.com/hero/500841', 'https://valor.militarytimes.com/hero/500840', 'https://valor.militarytimes.com/hero/500839', 'https://valor.militarytimes.com/hero/500838', 'https://valor.militarytimes.com/hero/500837', 'https://valor.militarytimes.com/hero/500836', 'https://valor.militarytimes.com/hero/500835', 'https://valor.militarytimes.com/hero/500834', 'https://valor.militarytimes.com/hero/500833', 'https://valor.militarytimes.com/hero/500832', 'https://valor.militarytimes.com/hero/500831', 'https://valor.militarytimes.com/hero/500830', 'https://valor.militarytimes.com/hero/500829', 'https://valor.militarytimes.com/hero/500827', 'https://valor.militarytimes.com/hero/500826', 'https://valor.militarytimes.com/hero/500817', 'https://valor.militarytimes.com/hero/500816', 'https://valor.militarytimes.com/hero/500815', 'https://valor.militarytimes.com/hero/500813', 'https://valor.militarytimes.com/hero/500808', 'https://valor.militarytimes.com/hero/401188', 'https://valor.militarytimes.com/hero/401185', 'https://valor.militarytimes.com/hero/89851', 'https://valor.militarytimes.com/hero/89846']

答案 1 :(得分:0)

您可以同时使用硒webdriver和精美的汤料

from selenium import webdriver
import time
from bs4 import BeautifulSoup
url = 'https://valor.militarytimes.com/award/5?page=1'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('window-size=1920x1080');
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(10)
page=driver.page_source
soup=BeautifulSoup(page,'lxml')
items = soup.select('a',href=True)
hero=[]
for item in items:
   if 'hero' in item['href']:
       print(item['href'])
       hero.append(item['href'])

print(hero)

输出:

/hero/500963
/hero/500962
/hero/500961
/hero/500941
/hero/94465
/hero/94175
/hero/92498
/hero/500944
/hero/500943
/hero/500942
/hero/314466
/hero/102316
/hero/89255
/hero/71533
/hero/500938
/hero/500937
/hero/500935
/hero/500934
/hero/500933
/hero/500932
/hero/500931
/hero/500930
/hero/500929
/hero/500927


['/hero/500963', '/hero/500962', '/hero/500961', '/hero/500941', '/hero/94465', '/hero/94175', '/hero/92498', '/hero/500944', '/hero/500943', '/hero/500942', '/hero/314466', '/hero/102316', '/hero/89255', '/hero/71533', '/hero/500938', '/hero/500937', '/hero/500935', '/hero/500934', '/hero/500933', '/hero/500932', '/hero/500931', '/hero/500930', '/hero/500929', '/hero/500927']

答案 2 :(得分:0)

您可以向API发出POST请求,以检索包含每个收件人ID的json,您可以将其串联到基本url上以提供每个收件人的完整url。 json包含最后一页的网址,因此您可以确定所有页面上后续循环的终点。

Enter

我一直在测试所有结果,发现您可能需要退后再试。

您可以按以下步骤构建其他网址:

import requests
import pandas as pd

baseUrl = 'https://valor.militarytimes.com/hero/'
url = 'https://valor.militarytimes.com/api/awards/5?page=1'
headers = {

  'Accept' : 'application/json, text/plain, */*' ,
  'Referer' : 'https://valor.militarytimes.com/award/5?page=1',
  'User-Agent' : 'Mozilla/5.0'    
}

info = requests.post(url, headers = headers, data = '').json()
urls = [baseUrl + str(item['recipient']['id']) for item in info['data']]   #page 1
linksInfo = info['links']
firstLink = linksInfo['first']
lastLink = linksInfo['last']
lastPage = lastLink.replace('https://valor.militarytimes.com/api/awards/5?page=','')
print('last page = ' + lastPage)
print(urls)
相关问题