如何抓取网页的其他页面

时间:2019-09-23 07:57:45

标签: python pandas beautifulsoup

在社区的一些帮助下,我得以从网页上抓取了一些信息。但是,我在从网站的其他页面上抓取信息时遇到了一些麻烦。

下面显示的代码能够获取以下各年的信息:(“日期”,“类型”,“注册”,“操作员”,“脂肪”,“位置”,“猫”)网页(从1919年至2019年)。按年份显示的网址示例为

https://aviation-safety.net/database/dblist.php?Year=1946

但是,我意识到,每个URL中都有很多其他页面,例如

https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=2 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=3 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=4

是否想知道每年如何刮除其他页面?

import pandas as pd
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}


#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']

        links.append(mainurl + url)


#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])

df.head(10)



#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
    try:
        html = requests.get(x, headers=headers).text   # <----- added headers
        table = pd.read_html(html)[0]    # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0

        results_df = results_df.append(table, sort=True).reset_index(drop=True)
        print ('Processed: %s' %x)
    except:
        print ('No table found: %s' %x)
        no_table.append(x)


results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]

1 个答案:

答案 0 :(得分:1)

您可以使用beautifulsoup检查包含页面数的[Vue warn]: Unknown custom element: <Dropdown> - did you register the component correctly? For recursive components, make sure to provide the "name" option. found in ---> <RowsPerPageDropdown> at node_modules/primevue/components/paginator/RowsPerPageDropdown.vue <DTPaginator> at node_modules/primevue/components/paginator/Paginator.vue <DataTable> at node_modules/primevue/components/datatable/DataTable.vue <UltimasComunicaciones> at src/components/UltimasComunicaciones.vue <App> at src/App.vue <Root> 标记,然后看来您可以对其进行迭代。可能是一种更好的方法,但是我只是在其中添加了另一个try / except来处理是否发现了其他页面:

<script>
import axios from "axios";
import DataTable from "primevue/datatable";
import Column from "primevue/column";
import Button from "primevue/button";
import Dropdown from "primevue/dropdown";

export default {
  name: "UltimasComunicaciones",
  components: {
    DataTable,
    Column,
    Button,
    Dropdown
  },