“你好, 我对Python和网络爬虫很陌生。我已经获得了一个URL列表,并希望从每个单独链接中的表中检索数据,但是,遇到了一些问题。
“到目前为止,这是我尝试过的”
#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#create empty array
accidentdata = []
#Loop through the URLs retrieved previously
for x in df['url']:
html = requests.get(x).text
soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
accidentdata_table = soup.find('table', {"class" : "list"})
#try clause to skip any other tables
try:
#loop through table, grab each of the 9 columns in the accident data
for row in accidentdata_table.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 9:
accidentdata.append((x, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip(), cols[4].text.strip(), cols[5].text.strip(), cols[6].text.strip, cols[7].text.strip(), cols[8].text.strip()))
except: pass
#convert output to new array, check length
accidentdata_array = np.asarray(accidentdata)
len(accidentdata_array)
#convert new array to dataframe
df = pd.DataFrame(accidentdata_array)
“ len(accidentdata_array)的输出为0。该代码似乎可以抓取,但是我没有得到想要的结果”
我希望从以下几列中获取数据:日期;类型;注册;操作员死亡人数位置;类别。
代码是否有问题?非常感谢您的帮助,谢谢!”
答案 0 :(得分:3)
进行了一些修改,但是主要问题是您需要在requests
中添加用户代理。
headers
添加了user-agent
参数pd.read_html()
仅使用bs4
在幕后解析<table>
标签代码:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
try:
html = requests.get(x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %x)
except:
print ('No table found: %s' %x)
no_table.append(x)
results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]
输出:
print (no_table)
['https://aviation-safety.net/database/dblist.php?Year=1920']
print (results_df)
date type ... location cat
0 date unk. Antonov An-12B ... NaN U1
1 date unk. Antonov An-12B ... NaN U1
2 date unk. Antonov An-12B ... NaN U1
3 date unk. Antonov An-12BK ... Tiksi Airpor... A1
4 date unk. Antonov An-12BP ... Massawa Airp... A1
5 date unk. Antonov An-12BP ... NaN U1
6 date unk. Antonov An-2 ... unknown A1
7 date unk. Antonov An-2 ... Chita region A2
8 date unk. Antonov An-24B ... NaN A1
9 date unk. Antonov An-26 ... Belgorod Air... A1
10 date unk. Antonov An-26 ... Wadi Bu al H... A1
11 date unk. Antonov An-26 ... NaN A1
12 date unk. Antonov An-26 ... Orenburg Air... O1
13 date unk. Antonov An-2R ... NaN U1
14 date unk. Antonov An-2R ... Mielec O1
15 date unk. Antonov An-32 ... Kalaikunda A... A1
16 date unk. Antonov An-32A ... NaN A1
17 date unk. Avia 14 ... Sofia-Vrazhd... O1
18 date unk. BN-2A Islander ... NaN U1
19 date unk. BN-2A Islander ... NaN U1
20 date unk. BN-2A Islander ... Nassau Inter... A1
21 date unk. BN-2A Islander ... NaN U1
22 date unk. BN-2A-20 Islander ... Charles Prin... U1
23 date unk. BN-2A-21 Islander ... NaN U1
24 date unk. BN-2A-21 Islander ... NaN U1
25 date unk. BN-2A-21 Islander ... NaN U1
26 date unk. BN-2A-21 Islander ... NaN U1
27 date unk. BN-2A-26 Islander ... Paphos Inter... U1
28 date unk. BN-2A-8 Islander ... Toluca ? U1
29 date unk. BN-2A-8 Islander ... NaN U1
... ... ... ... ..
8468 19-JUN-2019 Antonov An-124-100 ... Tripoli Inte... C1
8469 20-JUN-2019 Antonov An-2 ... near Rodina villa... A1
8470 21-JUN-2019 Basler Turbo 67 (DC-3T) ... near Fort Hope Ai... A2
8471 23-JUN-2019 Antonov An-2 ... near Mlyny, Polta... A1
8472 24-JUN-2019 Hawker Siddeley HS-125-400 ... Parque Nacio... O1
8473 27-JUN-2019 Antonov An-24RV ... Nizhneangars... A1
8474 27-JUN-2019 BAe 3212 Jetstream 31 ... Canaima Airp... A1
8475 28-JUN-2019 Saab 340A ... Nassau-Lynde... O2
8476 29-JUN-2019 Cessna 208B Grand Caravan ... Plant City-B... A2
8477 30-JUN-2019 Beech B300 King Air 350i ... Dallas-Addis... A1
8478 01-JUL-2019 Boeing 737-85R ... Mumbai-Chhat... A2
8479 08-JUL-2019 Airbus A320-214 ... Tripoli-Miti... C2
8480 08-JUL-2019 Cessna 208B Grand Caravan ... Bethel Airpo... A1
8481 08-JUL-2019 Canadair CL-415 ... near Roberval Air... A2
8482 09-JUL-2019 Airbus A320-214 (WL) ... Amsterdam-Sc... A2
8483 09-JUL-2019 Boeing 737-8K2 (WL) ... Amsterdam-Sc... A2
8484 09-JUL-2019 Antonov An-2 ... near Raduga, Novo... A1
8485 13-JUL-2019 Beech B200 Super King Air ... Graham Creek... C1
8486 16-JUL-2019 Antonov An-2 ... Novoshchedri... A1
8487 17-JUL-2019 Cessna 550 Citation II ... Mesquite Mun... A1
8488 19-JUL-2019 DHC-8-402Q Dash 8 ... Edmonton Int... A2
8489 20-JUL-2019 ATR 42-500 ... Gilgit Airpo... A2
8490 23-JUL-2019 Boeing 737-36N (WL) ... Lagos-Murtal... A2
8491 25-JUL-2019 Ilyushin Il-76TD ... Al Jufra Air... C1
8492 25-JUL-2019 Ilyushin Il-76TD ... Al Jufra Air... C1
8493 26-JUL-2019 Cessna 208 Caravan 675 ... Addenbroke I... A1
8494 27-JUL-2019 Swearingen SA227-AC Metro III ... El Paso Inte... A2
8495 30-JUL-2019 Beech B300 King Air 350i ... Mora Kalu, R... A1
8496 30-JUL-2019 Antonov An-72P ... near Grand Batanga A1
8497 01-AUG-2019 Douglas C-118A Liftmaster (DC-6A) ... Candle 2 Air... A2
[8498 rows x 7 columns]