“在协助下,我能够提取按年份(从1919年至2019年)排序的Web链接列表。基于这些Web链接,我想提取表数据”
“我能够获取1919-2019年的URL。但是,我需要从每个年链接中获取其他链接”
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#save the dataframe
yearlinks = df.to_csv('C:/Users/123/aviationsafetyyearlinks.csv')
#obtained list of URLs.
df = pd.read_csv('C:/Users/123/aviationsafetyyearlinks.csv')
ankers = df.url
for anker in ankers:
result = requests.get(anker, headers = headers)
soup = BeautifulSoup(result.content, 'html.parser')
datatable2 = soup.find_all('a', href = True)
print(datatable2)
links = []
for link in datatable2:
if "id=" in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df2 = pd.DataFrame(links, columns=['addurl'])
print(df2)
“根据代码,我只能获取2019年的单个链接,我不确定为什么,但是datatable2显示了1919年至2019年的所有HTML内容以及每个附加链接”“感谢任何形式的帮助非常感谢!”
答案 0 :(得分:1)
您每次都在循环中重新创建datatable2,因此仅保留循环中的最后一个值。您要创建外部并将其附加到循环中。我利用CSS attribute = value选择器和列表解析来进行URL过滤。
您可以进行一些变量/函数重命名,使用Session
重新使用连接,并整理几行代码。
将href
检查中的一项更改为包含database/record
的属性值,以便仅获取适用的链接,并更改末尾URL的附加前缀。
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
main_url = "https://aviation-safety.net/database/"
def get_and_parse_url(main_url):
result = s.get(main_url)
soup = BeautifulSoup(result.content, 'html.parser')
data_table = [main_url + i['href'] for i in soup.select('[href*=Year]')]
return data_table
with requests.Session() as s:
data_table = get_and_parse_url(main_url)
df = pd.DataFrame(data_table, columns=['url'])
datatable2 = [] #create outside so can append to it
for anker in df.url:
result = s.get(anker, headers = headers)
soup = BeautifulSoup(result.content, 'html.parser')
datatable2.append(['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')])
#flatten list of lists
datatable2 = [i for sublist in datatable2 for i in sublist]
df2 = pd.DataFrame(datatable2 , columns=['add_url'])
print(df2)