使用URL列表,如何从每个URL中的HTML表中检索数据

时间:2019-08-13 16:00:18

标签: python-3.x web-scraping beautifulsoup urllib

“你好, 我对Python和网络爬虫很陌生。我已经获得了一个URL列表,并希望从每个单独链接中的表中检索数据,但是,遇到了一些问题。

“到目前为止,这是我尝试过的”

#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']
        links.append(mainurl + url)


#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])

df.head(10)



#create empty array
accidentdata = []
#Loop through the URLs retrieved previously
for x in df['url']:
    html = requests.get(x).text
    soup = BeautifulSoup(html, "html.parser")
#identify table we want to scrape
    accidentdata_table = soup.find('table', {"class" : "list"})

#try clause to skip any other tables
try:
#loop through table, grab each of the 9 columns in the accident data
    for row in accidentdata_table.find_all('tr'):
        cols = row.find_all('td')
        if len(cols) == 9:
            accidentdata.append((x, cols[0].text.strip(), cols[1].text.strip(), cols[2].text.strip(), cols[3].text.strip(), cols[4].text.strip(), cols[5].text.strip(), cols[6].text.strip, cols[7].text.strip(), cols[8].text.strip()))
except: pass

#convert output to new array, check length
accidentdata_array = np.asarray(accidentdata)
len(accidentdata_array)

#convert new array to dataframe
df = pd.DataFrame(accidentdata_array)

“ len(accidentdata_array)的输出为0。该代码似乎可以抓取,但是我没有得到想要的结果”

我希望从以下几列中获取数据:日期;类型;注册;操作员死亡人数位置;类别。

代码是否有问题?非常感谢您的帮助,谢谢!”

1 个答案:

答案 0 :(得分:3)

进行了一些修改,但是主要问题是您需要在requests中添加用户代理。

  1. headers添加了user-agent参数
  2. 用熊猫来拉桌子,因为pd.read_html()仅使用bs4 在幕后解析<table>标签
  3. 在每个链接中都找不到一个表,因此我添加了一个列表来保存 找不到表,您可以调查这些表

代码:

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}


#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
   result = requests.get(mainurl)
   soup = BeautifulSoup(result.content, 'html.parser')
   datatable = soup.find_all('a', href = True)
   return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs
links = []
for link in datatable:
    if 'Year' in link['href']:
        url = link['href']
        links.append(mainurl + url)


#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])

df.head(10)



#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
    try:
        html = requests.get(x, headers=headers).text   # <----- added headers
        table = pd.read_html(html)[0]    # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0

        results_df = results_df.append(table, sort=True).reset_index(drop=True)
        print ('Processed: %s' %x)
    except:
        print ('No table found: %s' %x)
        no_table.append(x)


results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]

输出:

print (no_table)
['https://aviation-safety.net/database/dblist.php?Year=1920']


print (results_df)
             date                               type  ...              location cat
0       date unk.                     Antonov An-12B  ...                   NaN  U1
1       date unk.                     Antonov An-12B  ...                   NaN  U1
2       date unk.                     Antonov An-12B  ...                   NaN  U1
3       date unk.                    Antonov An-12BK  ...       Tiksi Airpor...  A1
4       date unk.                    Antonov An-12BP  ...       Massawa Airp...  A1
5       date unk.                    Antonov An-12BP  ...                   NaN  U1
6       date unk.                       Antonov An-2  ...               unknown  A1
7       date unk.                       Antonov An-2  ...          Chita region  A2
8       date unk.                     Antonov An-24B  ...                   NaN  A1
9       date unk.                      Antonov An-26  ...       Belgorod Air...  A1
10      date unk.                      Antonov An-26  ...       Wadi Bu al H...  A1
11      date unk.                      Antonov An-26  ...                   NaN  A1
12      date unk.                      Antonov An-26  ...       Orenburg Air...  O1
13      date unk.                      Antonov An-2R  ...                   NaN  U1
14      date unk.                      Antonov An-2R  ...                Mielec  O1
15      date unk.                      Antonov An-32  ...       Kalaikunda A...  A1
16      date unk.                     Antonov An-32A  ...                   NaN  A1
17      date unk.                            Avia 14  ...       Sofia-Vrazhd...  O1
18      date unk.                     BN-2A Islander  ...                   NaN  U1
19      date unk.                     BN-2A Islander  ...                   NaN  U1
20      date unk.                     BN-2A Islander  ...       Nassau Inter...  A1
21      date unk.                     BN-2A Islander  ...                   NaN  U1
22      date unk.                  BN-2A-20 Islander  ...       Charles Prin...  U1
23      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
24      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
25      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
26      date unk.                  BN-2A-21 Islander  ...                   NaN  U1
27      date unk.                  BN-2A-26 Islander  ...       Paphos Inter...  U1
28      date unk.                   BN-2A-8 Islander  ...              Toluca ?  U1
29      date unk.                   BN-2A-8 Islander  ...                   NaN  U1
          ...                                ...  ...                   ...  ..
8468  19-JUN-2019                 Antonov An-124-100  ...       Tripoli Inte...  C1
8469  20-JUN-2019                       Antonov An-2  ...  near Rodina villa...  A1
8470  21-JUN-2019            Basler Turbo 67 (DC-3T)  ...  near Fort Hope Ai...  A2
8471  23-JUN-2019                       Antonov An-2  ...  near Mlyny, Polta...  A1
8472  24-JUN-2019         Hawker Siddeley HS-125-400  ...       Parque Nacio...  O1
8473  27-JUN-2019                    Antonov An-24RV  ...       Nizhneangars...  A1
8474  27-JUN-2019              BAe 3212 Jetstream 31  ...       Canaima Airp...  A1
8475  28-JUN-2019                          Saab 340A  ...       Nassau-Lynde...  O2
8476  29-JUN-2019          Cessna 208B Grand Caravan  ...       Plant City-B...  A2
8477  30-JUN-2019           Beech B300 King Air 350i  ...       Dallas-Addis...  A1
8478  01-JUL-2019                     Boeing 737-85R  ...       Mumbai-Chhat...  A2
8479  08-JUL-2019                    Airbus A320-214  ...       Tripoli-Miti...  C2
8480  08-JUL-2019          Cessna 208B Grand Caravan  ...       Bethel Airpo...  A1
8481  08-JUL-2019                    Canadair CL-415  ...  near Roberval Air...  A2
8482  09-JUL-2019               Airbus A320-214 (WL)  ...       Amsterdam-Sc...  A2
8483  09-JUL-2019                Boeing 737-8K2 (WL)  ...       Amsterdam-Sc...  A2
8484  09-JUL-2019                       Antonov An-2  ...  near Raduga, Novo...  A1
8485  13-JUL-2019          Beech B200 Super King Air  ...       Graham Creek...  C1
8486  16-JUL-2019                       Antonov An-2  ...       Novoshchedri...  A1
8487  17-JUL-2019             Cessna 550 Citation II  ...       Mesquite Mun...  A1
8488  19-JUL-2019                  DHC-8-402Q Dash 8  ...       Edmonton Int...  A2
8489  20-JUL-2019                         ATR 42-500  ...       Gilgit Airpo...  A2
8490  23-JUL-2019                Boeing 737-36N (WL)  ...       Lagos-Murtal...  A2
8491  25-JUL-2019                   Ilyushin Il-76TD  ...       Al Jufra Air...  C1
8492  25-JUL-2019                   Ilyushin Il-76TD  ...       Al Jufra Air...  C1
8493  26-JUL-2019             Cessna 208 Caravan 675  ...       Addenbroke I...  A1
8494  27-JUL-2019      Swearingen SA227-AC Metro III  ...       El Paso Inte...  A2
8495  30-JUL-2019           Beech B300 King Air 350i  ...       Mora Kalu, R...  A1
8496  30-JUL-2019                     Antonov An-72P  ...    near Grand Batanga  A1
8497  01-AUG-2019  Douglas C-118A Liftmaster (DC-6A)  ...       Candle 2 Air...  A2

[8498 rows x 7 columns]