我想从网站上抓取网址列表,然后逐个打开它们。 我可以获取所有网址的列表,但然后尝试变成列表出错。 当我打印列表时,我会在控制台中找到这样的内容[url1,urls2 ...]:
[url1,url2,url3] dif line
[url1,url2,url3,url4] difline
[url1,url2,url3,url4,url5]
找到我的脚本:
driver = webdriver.Chrome()
my_url="https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
driver.get(my_url)
time.sleep(3)
content = driver.page_source.encode('utf-8').strip()
page_soup = soup(content,"html.parser")
links =[]
for link in page_soup.find_all('a', href=True):
url=link['href']
ai=str(url)
links.append(ai)
print(links)
links.append(ai)
print(links)
答案 0 :(得分:0)
我稍微重写了你的代码。首先,您需要加载和废弃主页以获取“href”的所有链接。之后,只需在循环中使用抓取的网址即可获得下一页。
“href”中还有一些不是url的垃圾,所以你必须先清理它。
我更喜欢要求做GET。
http://docs.python-requests.org/en/master/
我希望它有所帮助。
from bs4 import BeautifulSoup
import requests
def main():
links = []
url = "https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
web_page = requests.get(url)
soup = BeautifulSoup(web_page.content, "html.parser")
a_tags = soup.find_all('a', href=True)
for a in a_tags:
links.append(a.get("href"))
print(links) # just to demonstrate that links are there
cleaned_list = []
for link in links:
if "http" in link:
cleaned_list.append(link)
print(cleaned_list)
return cleaned_list
def load_pages_from_links(urls):
user_agent = {'User-agent': 'Mozilla/5.0'}
links = urls
downloaded_pages = {}
if len(links) == 0:
return "There are no links."
else:
for nr, link in enumerate(links):
web_page = requests.get(link, headers=user_agent)
downloaded_pages[nr] = web_page.content
print(downloaded_pages)
if __name__ == "__main__":
links = main()
load_pages_from_links(links)