无法在网页抓取中遍历href值

时间:2020-01-17 08:07:51

标签: python web-scraping beautifulsoup

我对网页抓取非常陌生。我需要在网页中剪贴特定部分的锚标记链接,但不幸的是,我缺少了一些我找不到的东西。它只打印一个链接。

import requests
from bs4 import BeautifulSoup

page  = requests.get('https://www.privacy.gov.ph/memorandum-circulars/')
prefix = 'https://www.privacy.gov.ph' 
soup = BeautifulSoup(page.content,'html.parser')

container = soup.findAll("section", {"class": "news_content"})
for circulars in container:

     pdf =  prefix + circulars.div.a['href'].replace("..", "")
     print(pdf)

3 个答案:

答案 0 :(得分:1)

这是因为您在部分上进行了npm,但只有一个部分。

您可以这样做:

代码:

find_all

结果:

import requests
from bs4 import BeautifulSoup

page  = requests.get('https://www.privacy.gov.ph/memorandum-circulars/')
prefix = 'https://www.privacy.gov.ph' 
soup = BeautifulSoup(page.content,'html.parser')

section = soup.find("section", {"class": "news_content"})
for link in section.find_all("a"):
     pdf =  prefix + link['href'].replace(prefix,"").replace("..", "")
     print(pdf)

答案 1 :(得分:1)

尝试下面的代码

import requests
from bs4 import BeautifulSoup

page  = requests.get('https://www.privacy.gov.ph/memorandum-circulars/')
prefix = 'https://www.privacy.gov.ph' 
soup = BeautifulSoup(page.content,'html.parser')

container = soup.findAll("section", {"class": "news_content"})
for circulars in container:
    for a in circulars.findAll('a', href=True):
         pdf =  prefix + a['href'].replace("..", "") if prefix not in a['href'] else a['href']
         print(pdf)

输出

https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-01-security-of-personal-data-in-government-agencies/
https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-02-data-sharing-agreements-involving-government-agencies/
https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-03-personal-data-breach-management/
https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-04-rules-of-procedure/
https://www.privacy.gov.ph/npc-circular-17-01-registration-data-processing-notifications-regarding-automated-decision-making/
https://www.privacy.gov.ph/wp-content/uploads/2017/08/NPC17-01_Appendix-1.pdf
https://www.privacy.gov.ph/npc-circular-no-18-01-rules-of-procedure-on-requests-for-advisory-opinions/
https://www.privacy.gov.ph/npc-circular-no-18-02-guidelines-on-compliance-checks/
https://www.privacy.gov.ph/npc-circular-no-18-03-rules-on-mediation-before-the-national-privacy-commission

答案 2 :(得分:0)

尝试简体中文文档的解决方案,它可以帮助您将相对路径转换为完整路径。

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://www.privacy.gov.ph/memorandum-circulars/'
page  = requests.get(url)
doc = SimplifiedDoc(page.text)
container = doc.getElement('section',attr='class',value='news_content')
lstA = container.listA(url=url)
print ([a.url for a in lstA])

结果:

['https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-01-security-of-personal-data-in-government-agencies/', 'https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-02-data-sharing-agreements-involving-government-agencies/', 'https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-03-personal-data-breach-management/', 'https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-04-rules-of-procedure/', 'https://www.privacy.gov.ph/npc-circular-17-01-registration-data-processing-notifications-regarding-automated-decision-making/', 'https://www.privacy.gov.ph/wp-content/uploads/2017/08/npc17-01_appendix-1.pdf', 'https://www.privacy.gov.ph/npc-circular-no-18-01-rules-of-procedure-on-requests-for-advisory-opinions/', 'https://www.privacy.gov.ph/npc-circular-no-18-02-guidelines-on-compliance-checks/', 'https://www.privacy.gov.ph/npc-circular-no-18-03-rules-on-mediation-before-the-national-privacy-commission']

您可以获取SimplifiedDoc here

的示例