我是python的新手。在一些帮助下,我编写了一些代码来从网页上抓取一些数据。但是,我只能根据代码抓取每个链接的首页。
当前,下面的代码允许我根据首页抓取每年数据(https://aviation-safety.net/database/dblist.php?Year=1949)的链接。
但是,在某些年份中,特定年份的链接(https://aviation-safety.net/database/dblist.php?Year=1949&lang=&page=2中还有其他页面(例如,第2页,第3页,第4页) (https://aviation-safety.net/database/dblist.php?Year=1949&lang=&page=3)
我想知道是否有可能根据每年数据的附加页面来检索附加链接。
#get the additional links within each Year Link
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
main_url = "https://aviation-safety.net/database/"
def get_and_parse_url(main_url):
result = requests.get(main_url)
soup = BeautifulSoup(result.content, 'html.parser')
data_table = [main_url + i['href'] for i in soup.select('[href*=Year]')]
return data_table
with requests.Session() as s:
data_table = get_and_parse_url(main_url)
df = pd.DataFrame(data_table, columns=['url'])
datatable2 = [] #create outside so can append to it
for anker in df.url:
result = s.get(anker, headers = headers)
soup = BeautifulSoup(result.content, 'html.parser')
datatable2.append(['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')])
#flatten list of lists
datatable2 = [i for sublist in datatable2 for i in sublist]
df2 = pd.DataFrame(datatable2 , columns=['add_url'])
for i in df2.add_url:
print(i)
非常感谢任何帮助,谢谢!
答案 0 :(得分:2)
对于每个初始记录页面,您可以通过在类a
的元素中收集子pagenumbers
标签的匹配项来确定存在的附加页数(通过添加第n个元素来限制前一页) -类型);在列表理解中执行此操作,该列表生成实际的其他页面URL;然后为这些页面使用额外的循环收集。在撰写本文时,它会产生22,629个不同的链接。
import requests
from bs4 import BeautifulSoup as bs
base = 'https://aviation-safety.net/database/'
headers = {'User-Agent':'Mozilla/5.0'}
inner_links = []
def get_soup(url):
r = s.get(url, headers = headers)
soup = bs(r.text, 'lxml')
return soup
with requests.Session() as s:
soup = get_soup('https://aviation-safety.net/database/')
initial_links = [base + i['href'] for i in soup.select('[href*="Year="]')]
for link in initial_links:
soup = get_soup(link)
inner_links+= ['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')]
pages = [f'{link}&lang=&page={i.text}' for i in soup.select('.pagenumbers:nth-of-type(2) a')]
for page in pages:
soup = get_soup(page)
inner_links+= ['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')]