感谢stackoverflow.com我能够编写一个从任何给定网页上抓取网页链接的程序。但是,我需要它将主URL连接到它遇到的任何相对链接。 (示例:“http://www.google.com/sitemap”没关系。但是“/ sitemap”本身并不合适。)
在以下代码中,
from bs4 import BeautifulSoup as mySoup
from urllib.parse import urljoin as myJoin
from urllib.request import urlopen as myRequest
base_url = "https://www.census.gov/programs-surveys/popest.html"
html_page = myRequest(base_url)
raw_html = html_page.read()
page_soup = mySoup(raw_html, "html.parser")
html_page.close()
f = open("census4-3.csv", "w")
all_links = page_soup.find_all('a', href=True)
def clean_links(tags, base_url):
cleaned_links = set()
for tag in tags:
link = tag.get('href')
if link is None:
continue
full_url = myJoin(base_url, link)
cleaned_links.add(full_url)
return cleaned_links
cleaned_links = clean_links(all_links, base_url)
for link in cleaned_links:
f.write(str(link) + '\n')
f.close()
print("The CSV file is saved to your computer.")
我将如何以及在何处添加以下内容:
.append("http://www.google.com")
答案 0 :(得分:1)
您应将基本网址保存为base_url = 'https://www.census.gov'
。
像这样调用请求
html_page = myRequest(base_url + '/programs-surveys/popest.html')
如果您想获得full_url
,请执行此操作
full_url = base_url + link