在社区的一些帮助下,我得以从网页上抓取了一些信息。但是,我在从网站的其他页面上抓取信息时遇到了一些麻烦。
下面显示的代码能够获取以下各年的信息:(“日期”,“类型”,“注册”,“操作员”,“脂肪”,“位置”,“猫”)网页(从1919年至2019年)。按年份显示的网址示例为
https://aviation-safety.net/database/dblist.php?Year=1946
但是,我意识到,每个URL中都有很多其他页面,例如
https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=2 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=3 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=4
是否想知道每年如何刮除其他页面?
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
try:
html = requests.get(x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %x)
except:
print ('No table found: %s' %x)
no_table.append(x)
results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]
答案 0 :(得分:1)
您可以使用beautifulsoup检查包含页面数的[Vue warn]: Unknown custom element: <Dropdown> - did you register the component correctly? For recursive components, make sure to provide the "name" option.
found in
---> <RowsPerPageDropdown> at node_modules/primevue/components/paginator/RowsPerPageDropdown.vue
<DTPaginator> at node_modules/primevue/components/paginator/Paginator.vue
<DataTable> at node_modules/primevue/components/datatable/DataTable.vue
<UltimasComunicaciones> at src/components/UltimasComunicaciones.vue
<App> at src/App.vue
<Root>
标记,然后看来您可以对其进行迭代。可能是一种更好的方法,但是我只是在其中添加了另一个try / except来处理是否发现了其他页面:
<script>
import axios from "axios";
import DataTable from "primevue/datatable";
import Column from "primevue/column";
import Button from "primevue/button";
import Dropdown from "primevue/dropdown";
export default {
name: "UltimasComunicaciones",
components: {
DataTable,
Column,
Button,
Dropdown
},