使用beautifulsoup从Wikipedia刮下整个表格,然后加载到熊猫中

时间:2019-12-18 00:35:55

标签: python pandas dataframe html-table beautifulsoup

我目前正在抓取以下Wiki页面:https://en.wikipedia.org/wiki/Cargo_aircraft,只有一个表格开始进行比较。我正在尝试刮整个桌子并将其输出到熊猫。我知道如何添加初始列Aircraft,但是在从体积开始刮取列时遇到了麻烦。

如何将表的所有行添加到数据框或列中?不知道哪种方法更好。



from bs4 import BeautifulSoup
import requests
import pandas as pd

#this will use request library to call wikipedia

page = requests.get('https://en.wikipedia.org/wiki/Cargo_aircraft')

#create beautifulsoup object

soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', attrs={'class':'wikitable sortable'})
tabledata = table.findAll('tbody')
links = table.findAll('a')




aircraft = []
for link in links:
    aircraft.append(link.get('title'))
print(aircraft)


#pull table from Wikipedia

df = pd.DataFrame()
df['Aircraft'] = aircraft
df['Test'] = 'test'

2 个答案:

答案 0 :(得分:3)

使用pandas.read_html

  • 绕过beautifulsoup并将表直接读入熊猫。
  • 将HTML表读入DataFrame对象的list
    • 在这种情况下,表位于索引[1]
import pandas as pd

df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]

# df view

                   Aircraft    Volume                  Payload             Cruise                  Range       Usage
0              Airbus A400M    270 m³    37,000 kg (82,000 lb)  780 km/h (420 kn)   6,390 km (3,450 nmi)    Military
1          Airbus A300-600F  391.4 m³   48,000 kg (106,000 lb)                  –   7,400 km (4,000 nmi)  Commercial
2          Airbus A330-200F    475 m³   70,000 kg (154,000 lb)  871 km/h (470 kn)   7,400 km (4,000 nmi)  Commercial
3             Airbus Beluga   1210 m³   47,000 kg (104,000 lb)                  –   4,632 km (2,500 nmi)  Commercial
4          Airbus Beluga XL   2615 m³   53,000 kg (117,000 lb)                  –   4,074 km (2,200 nmi)  Commercial
5            Antonov An-124   1028 m³  150,000 kg (331,000 lb)  800 km/h (430 kn)   5,400 km (2,900 nmi)        Both
6            Antonov An-225   1300 m³  250,000 kg (551,000 lb)  800 km/h (430 kn)  15,400 km (8,316 nmi)  Commercial
7               Boeing C-17         –   77,519 kg (170,900 lb)  830 km/h (450 kn)   4,482 km (2,420 nmi)    Military
8           Boeing 737-700C  107.6 m³    18,200 kg (40,000 lb)  931 km/h (503 kn)   5,330 km (2,880 nmi)  Commercial
9           Boeing 757-200F    239 m³    39,780 kg (87,700 lb)  955 km/h (516 kn)   5,834 km (3,150 nmi)  Commercial
10            Boeing 747-8F  854.5 m³  134,200 kg (295,900 lb)  908 km/h (490 kn)   8,288 km (4,475 nmi)  Commercial
11           Boeing 747 LCF   1840 m³   83,325 kg (183,700 lb)  878 km/h (474 kn)   7,800 km (4,200 nmi)  Commercial
12          Boeing 767-300F  438.2 m³   52,700 kg (116,200 lb)  850 km/h (461 kn)   6,025 km (3,225 nmi)  Commercial
13              Boeing 777F    653 m³  103,000 kg (227,000 lb)  896 km/h (484 kn)   9,070 km (4,900 nmi)  Commercial
14    Bombardier Dash 8-100     39 m³     4,700 kg (10,400 lb)  491 km/h (265 kn)   2,039 km (1,100 nmi)  Commercial
15             Lockheed C-5         –  122,470 kg (270,000 lb)           919 km/h   4,440 km (2,400 nmi)    Military
16           Lockheed C-130         –    20,400 kg (45,000 lb)  540 km/h (292 kn)   3,800 km (2,050 nmi)    Military
17         Douglas DC-10-30         –   77,000 kg (170,000 lb)  908 km/h (490 kn)   5,790 km (3,127 nmi)  Commercial
18  McDonnell Douglas MD-11    440 m³   91,670 kg (202,100 lb)  945 km/h (520 kn)   7,320 km (3,950 nmi)  Commercial

答案 1 :(得分:0)

您可以尝试:

df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
df['Volume'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Volume'].str.split()]).astype(float)
df['Payload'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Payload'].str.split()]).astype(int)
df['Cruise'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Cruise'].str.split()]).astype(float)
df['Range'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Range'].str.split()]).astype(int)

结果:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 6 columns):
Aircraft    19 non-null object
Volume      15 non-null float64
Payload     19 non-null int64
Cruise      16 non-null float64
Range       19 non-null int64
Usage       19 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 1.0+ KB

print(df)

                   Aircraft  Volume  Payload  Cruise  Range       Usage
0              Airbus A400M   270.0    37000   780.0   6390    Military
1          Airbus A300-600F   391.4    48000     NaN   7400  Commercial
2          Airbus A330-200F   475.0    70000   871.0   7400  Commercial
3             Airbus Beluga  1210.0    47000     NaN   4632  Commercial
4          Airbus Beluga XL  2615.0    53000     NaN   4074  Commercial
5            Antonov An-124  1028.0   150000   800.0   5400        Both
6            Antonov An-225  1300.0   250000   800.0  15400  Commercial
7               Boeing C-17     NaN    77519   830.0   4482    Military
8           Boeing 737-700C   107.6    18200   931.0   5330  Commercial
9           Boeing 757-200F   239.0    39780   955.0   5834  Commercial
10            Boeing 747-8F   854.5   134200   908.0   8288  Commercial
11           Boeing 747 LCF  1840.0    83325   878.0   7800  Commercial
12          Boeing 767-300F   438.2    52700   850.0   6025  Commercial
13              Boeing 777F   653.0   103000   896.0   9070  Commercial
14    Bombardier Dash 8-100    39.0     4700   491.0   2039  Commercial
15             Lockheed C-5     NaN   122470   919.0   4440    Military
16           Lockheed C-130     NaN    20400   540.0   3800    Military
17         Douglas DC-10-30     NaN    77000   908.0   5790  Commercial
18  McDonnell Douglas MD-11   440.0    91670   945.0   7320  Commercial