我有一列网址,但是我只能提取其中一个网址的数据

时间:2019-08-18 15:10:14

标签: python-3.x web-scraping beautifulsoup

“在协助下,我能够提取按年份(从1919年至2019年)排序的Web链接列表。基于这些Web链接,我想提取表数据”

“我能够获取1919-2019年的URL。但是,我需要从每个年链接中获取其他链接”

import pandas as pd
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}


#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']
        links.append(mainurl + url)


#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])

df.head(10)


#save the dataframe

yearlinks = df.to_csv('C:/Users/123/aviationsafetyyearlinks.csv')

#obtained list of URLs.

df = pd.read_csv('C:/Users/123/aviationsafetyyearlinks.csv')

ankers = df.url
for anker in ankers:
    result = requests.get(anker, headers = headers)
    soup = BeautifulSoup(result.content, 'html.parser')
    datatable2 = soup.find_all('a', href = True)

    print(datatable2)


links = []
for link in datatable2:
    if "id=" in link['href']:
        url = link['href']
        links.append(mainurl + url)

#check if links are in dataframe
df2 = pd.DataFrame(links, columns=['addurl'])

print(df2)

“根据代码,我只能获取2019年的单个链接,我不确定为什么,但是datatable2显示了1919年至2019年的所有HTML内容以及每个附加链接”“感谢任何形式的帮助非常感谢!”

1 个答案:

答案 0 :(得分:1)

您每次都在循环中重新创建datatable2,因此仅保留循环中的最后一个值。您要创建外部并将其附加到循环中。我利用CSS attribute = value选择器和列表解析来进行URL过滤。

您可以进行一些变量/函数重命名,使用Session重新使用连接,并整理几行代码。 将href检查中的一项更改为包含database/record的属性值,以便仅获取适用的链接,并更改末尾URL的附加前缀。

import pandas as pd
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
main_url = "https://aviation-safety.net/database/"

def get_and_parse_url(main_url):
    result = s.get(main_url)
    soup = BeautifulSoup(result.content, 'html.parser')
    data_table = [main_url + i['href'] for i in soup.select('[href*=Year]')]
    return data_table

with requests.Session() as s:
    data_table = get_and_parse_url(main_url)
    df = pd.DataFrame(data_table, columns=['url'])
    datatable2 = [] #create outside so can append to it

    for anker in df.url:
        result = s.get(anker, headers = headers)
        soup = BeautifulSoup(result.content, 'html.parser')
        datatable2.append(['https://aviation-safety.net' + i['href'] for i in soup.select('[href*="database/record"]')])

 #flatten list of lists
datatable2 = [i for sublist in datatable2 for i in sublist]
df2 = pd.DataFrame(datatable2 , columns=['add_url'])
print(df2)