我对网页抓取非常陌生。我需要在网页中剪贴特定部分的锚标记链接,但不幸的是,我缺少了一些我找不到的东西。它只打印一个链接。
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.privacy.gov.ph/memorandum-circulars/')
prefix = 'https://www.privacy.gov.ph'
soup = BeautifulSoup(page.content,'html.parser')
container = soup.findAll("section", {"class": "news_content"})
for circulars in container:
pdf = prefix + circulars.div.a['href'].replace("..", "")
print(pdf)
答案 0 :(得分:1)
这是因为您在部分上进行了npm
,但只有一个部分。
您可以这样做:
代码:
find_all
结果:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.privacy.gov.ph/memorandum-circulars/')
prefix = 'https://www.privacy.gov.ph'
soup = BeautifulSoup(page.content,'html.parser')
section = soup.find("section", {"class": "news_content"})
for link in section.find_all("a"):
pdf = prefix + link['href'].replace(prefix,"").replace("..", "")
print(pdf)
答案 1 :(得分:1)
尝试下面的代码
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.privacy.gov.ph/memorandum-circulars/')
prefix = 'https://www.privacy.gov.ph'
soup = BeautifulSoup(page.content,'html.parser')
container = soup.findAll("section", {"class": "news_content"})
for circulars in container:
for a in circulars.findAll('a', href=True):
pdf = prefix + a['href'].replace("..", "") if prefix not in a['href'] else a['href']
print(pdf)
输出
https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-01-security-of-personal-data-in-government-agencies/
https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-02-data-sharing-agreements-involving-government-agencies/
https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-03-personal-data-breach-management/
https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-04-rules-of-procedure/
https://www.privacy.gov.ph/npc-circular-17-01-registration-data-processing-notifications-regarding-automated-decision-making/
https://www.privacy.gov.ph/wp-content/uploads/2017/08/NPC17-01_Appendix-1.pdf
https://www.privacy.gov.ph/npc-circular-no-18-01-rules-of-procedure-on-requests-for-advisory-opinions/
https://www.privacy.gov.ph/npc-circular-no-18-02-guidelines-on-compliance-checks/
https://www.privacy.gov.ph/npc-circular-no-18-03-rules-on-mediation-before-the-national-privacy-commission
答案 2 :(得分:0)
尝试简体中文文档的解决方案,它可以帮助您将相对路径转换为完整路径。
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://www.privacy.gov.ph/memorandum-circulars/'
page = requests.get(url)
doc = SimplifiedDoc(page.text)
container = doc.getElement('section',attr='class',value='news_content')
lstA = container.listA(url=url)
print ([a.url for a in lstA])
结果:
['https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-01-security-of-personal-data-in-government-agencies/', 'https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-02-data-sharing-agreements-involving-government-agencies/', 'https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-03-personal-data-breach-management/', 'https://www.privacy.gov.ph/memorandum-circulars/npc-circular-16-04-rules-of-procedure/', 'https://www.privacy.gov.ph/npc-circular-17-01-registration-data-processing-notifications-regarding-automated-decision-making/', 'https://www.privacy.gov.ph/wp-content/uploads/2017/08/npc17-01_appendix-1.pdf', 'https://www.privacy.gov.ph/npc-circular-no-18-01-rules-of-procedure-on-requests-for-advisory-opinions/', 'https://www.privacy.gov.ph/npc-circular-no-18-02-guidelines-on-compliance-checks/', 'https://www.privacy.gov.ph/npc-circular-no-18-03-rules-on-mediation-before-the-national-privacy-commission']
您可以获取SimplifiedDoc here
的示例