无法在行上拆分Web刮表

时间:2018-03-14 18:25:45

标签: python pandas dataframe web-scraping beautifulsoup

我使用BeautifulSoup从维基百科中拉出了一个环法自行车赛冠军的表格,但它将表格返回到看似数据集的位置,但这些行是可分的。

首先,这是我抓住页面和表格所做的:

import requests
response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners")
content = response.content

from bs4 import BeatifulSoup
parser = BeautifulSoup(content, 'html.parser')

# I know its the second table on the page, so grab it as such
winners_table = parser.find_all('table')[1]

import pandas as pd
data = pd.read_html(str(winners_table), flavor = 'html5lib')

请注意,我在这里使用了html5lib,因为pycharm告诉我没有lxml,尽管它确实在那里。当我打印出表格时,它显示为一个包含116行和9列的表格,但它似乎没有分割成行。它看起来像这样:

[           0              1  \
0       Year        Country   
1       1903         France   
2       1904         France   
3       1905         France   
4       1906         France   
5       1907         France   
6       1908         France   
7       1909     Luxembourg   
8       1910         France   
9       1911         France   
10      1912        Belgium   
11      1913        Belgium   
12      1914        Belgium   
13      1915    World War I   
14      1916            NaN   
15      1917            NaN   
16      1918            NaN   
17      1919        Belgium   
18      1920        Belgium   
19      1921        Belgium   
20      1922        Belgium   
21      1923         France   
22      1924          Italy   
23      1925          Italy   
24      1926        Belgium   
25      1927     Luxembourg   
26      1928     Luxembourg   
27      1929        Belgium   
28      1930         France   
29      1931         France   
..       ...            ...   
86      1988          Spain   
87      1989  United States   
88      1990  United States   
89      1991          Spain   
90      1992          Spain   
91      1993          Spain   
92      1994          Spain   
93      1995          Spain   
94      1996        Denmark   
95      1997        Germany   
96      1998          Italy   
97   1999[B]  United States   
98   2000[B]  United States   
99   2001[B]  United States   
100  2002[B]  United States   
101  2003[B]  United States   
102  2004[B]  United States   
103  2005[B]  United States   
104     2006          Spain   
105     2007          Spain   
106     2008          Spain   
107     2009          Spain   
108     2010     Luxembourg   
109     2011      Australia   
110     2012  Great Britain   
111     2013  Great Britain   
112     2014          Italy   
113     2015  Great Britain   
114     2016  Great Britain   
115     2017  Great Britain   

                                                     2  \
0                                              Cyclist   
1                          Garin, MauriceMaurice Garin   
2    Garin, MauriceMaurice Garin Cornet, HenriHenri...   
3                  Trousselier, LouisLouis Trousselier   
4                            Pottier, RenéRené Pottier   
5              Petit-Breton, LucienLucien Petit-Breton   
6              Petit-Breton, LucienLucien Petit-Breton   
7                        Faber, FrançoisFrançois Faber   
8                          Lapize, OctaveOctave Lapize   
9                    Garrigou, GustaveGustave Garrigou   
10                         Defraye, OdileOdile Defraye   
11                         Thys, PhilippePhilippe Thys   
12                         Thys, PhilippePhilippe Thys   
13                                                 NaN   
14                                                 NaN   
15                                                 NaN   
16                                                 NaN   
17                         Lambot, FirminFirmin Lambot   
18                         Thys, PhilippePhilippe Thys   
19                             Scieur, LéonLéon Scieur   
20                         Lambot, FirminFirmin Lambot   
21                     Pélissier, HenriHenri Pélissier   
22               Bottecchia, OttavioOttavio Bottecchia   
23               Bottecchia, OttavioOttavio Bottecchia   
24                         Buysse, LucienLucien Buysse   
25                       Frantz, NicolasNicolas Frantz   
26                       Frantz, NicolasNicolas Frantz   
27                   De Waele, MauriceMaurice De Waele   
28                           Leducq, AndréAndré Leducq   
29                         Magne, AntoninAntonin Magne   
..                                                 ...   
86                         Delgado, PedroPedro Delgado   
87                             LeMond, GregGreg LeMond   
88                             LeMond, GregGreg LeMond   
89                     Indurain, MiguelMiguel Indurain   
90                     Indurain, MiguelMiguel Indurain   
91                     Indurain, MiguelMiguel Indurain   
92                     Indurain, MiguelMiguel Indurain   
93                     Indurain, MiguelMiguel Indurain   
94                          Riis, BjarneBjarne Riis[A]   
95                            Ullrich, JanJan Ullrich#   
96                         Pantani, MarcoMarco Pantani   
97                     Armstrong, LanceLance Armstrong   
98                     Armstrong, LanceLance Armstrong   
99                     Armstrong, LanceLance Armstrong   
100                    Armstrong, LanceLance Armstrong   
101                    Armstrong, LanceLance Armstrong   
102                    Armstrong, LanceLance Armstrong   
103                    Armstrong, LanceLance Armstrong   
104  Landis, FloydFloyd Landis Pereiro, ÓscarÓscar ...   
105                 Contador, AlbertoAlberto Contador#   
106                       Sastre, CarlosCarlos Sastre*   
107                  Contador, AlbertoAlberto Contador   
108  Contador, AlbertoAlberto Contador Schleck, And...   
109                            Evans, CadelCadel Evans   
110                    Wiggins, BradleyBradley Wiggins   
111                          Froome, ChrisChris Froome   
112                    Nibali, VincenzoVincenzo Nibali   
113                         Froome, ChrisChris Froome*   
114                          Froome, ChrisChris Froome   
115                          Froome, ChrisChris Froome   

                                  3                        4  \
0                      Sponsor/Team                 Distance   
1                      La Française      2,428 km (1,509 mi)   
2                             Conte      2,428 km (1,509 mi)   
3                    Peugeot–Wolber      2,994 km (1,860 mi)   
4                    Peugeot–Wolber      4,637 km (2,881 mi)   
5                    Peugeot–Wolber      4,488 km (2,789 mi)   
6                    Peugeot–Wolber      4,497 km (2,794 mi)   
7                     Alcyon–Dunlop      4,498 km (2,795 mi)   
8                     Alcyon–Dunlop      4,734 km (2,942 mi)   
9                     Alcyon–Dunlop      5,343 km (3,320 mi)   
10                    Alcyon–Dunlop      5,289 km (3,286 mi)   
11                   Peugeot–Wolber      5,287 km (3,285 mi)   
12                   Peugeot–Wolber      5,380 km (3,340 mi)   
13                              NaN                      NaN   
14                              NaN                      NaN   
15                              NaN                      NaN   
16                              NaN                      NaN   
17                      La Sportive      5,560 km (3,450 mi)   
18                      La Sportive      5,503 km (3,419 mi)   
19                      La Sportive      5,485 km (3,408 mi)   
20                   Peugeot–Wolber      5,375 km (3,340 mi)   
21              Automoto–Hutchinson      5,386 km (3,347 mi)   
22                         Automoto      5,425 km (3,371 mi)   
23              Automoto–Hutchinson      5,440 km (3,380 mi)   
24              Automoto–Hutchinson      5,745 km (3,570 mi)   
25                    Alcyon–Dunlop      5,398 km (3,354 mi)   
26                    Alcyon–Dunlop      5,476 km (3,403 mi)   
27                    Alcyon–Dunlop      5,286 km (3,285 mi)   
28                    Alcyon–Dunlop      4,822 km (2,996 mi)   
29                           France      5,091 km (3,163 mi)   
..                              ...                      ...   
86                         Reynolds      3,286 km (2,042 mi)   
87      AD Renting–W-Cup–Bottecchia      3,285 km (2,041 mi)   
88                        Z–Tomasso      3,504 km (2,177 mi)   
89                          Banesto      3,914 km (2,432 mi)   
90                          Banesto      3,983 km (2,475 mi)   
91                          Banesto      3,714 km (2,308 mi)   
92                          Banesto      3,978 km (2,472 mi)   
93                          Banesto      3,635 km (2,259 mi)   
94                     Team Telekom      3,765 km (2,339 mi)   
95                     Team Telekom      3,950 km (2,450 mi)   
96            Mercatone Uno–Bianchi      3,875 km (2,408 mi)   
97              U.S. Postal Service      3,687 km (2,291 mi)   
98              U.S. Postal Service      3,662 km (2,275 mi)   
99              U.S. Postal Service      3,458 km (2,149 mi)   
100             U.S. Postal Service      3,272 km (2,033 mi)   
101             U.S. Postal Service      3,427 km (2,129 mi)   
102             U.S. Postal Service      3,391 km (2,107 mi)   
103               Discovery Channel      3,593 km (2,233 mi)   
104  Caisse d'Epargne–Illes Balears      3,657 km (2,272 mi)   
105               Discovery Channel      3,570 km (2,220 mi)   
106                        Team CSC      3,559 km (2,211 mi)   
107                          Astana      3,459 km (2,149 mi)   
108                  Team Saxo Bank      3,642 km (2,263 mi)   
109                 BMC Racing Team      3,430 km (2,130 mi)   
110                        Team Sky      3,496 km (2,172 mi)   
111                        Team Sky      3,404 km (2,115 mi)   
112                          Astana  3,660.5 km (2,274.5 mi)   
113                        Team Sky  3,360.3 km (2,088.0 mi)   
114                        Team Sky      3,529 km (2,193 mi)   
115                        Team Sky      3,540 km (2,200 mi)   

                     5                    6           7               8  
0          Time/Points               Margin  Stage wins  Stages in lead  
1     094 !94h 33' 14"  24921 !+ 2h 59' 21"           3               6  
2     096 !96h 05' 55"  21614 !+ 2h 16' 14"           1               3  
3                   35                   26           5              10  
4                   31                    8           5              12  
5                   47                   19           2               5  
6                   36                   32           5              13  
7                   37                   20           6              13  
8                   63                    4           4               3  
9                   43                   18           2              13  
10                  49                   59           3              13  
11   197 !197h 54' 00"      00837 !+ 8' 37"           1               8  
12   200 !200h 28' 48"      00150 !+ 1' 50"           1              15  
13                 NaN                  NaN         NaN             NaN  
14                 NaN                  NaN         NaN             NaN  
15                 NaN                  NaN         NaN             NaN  
16                 NaN                  NaN         NaN             NaN  
17   231 !231h 07' 15"  14254 !+ 1h 42' 54"           1               2  
18   228 !228h 36' 13"     05721 !+ 57' 21"           4              14  
19   221 !221h 50' 26"     01836 !+ 18' 36"           2              14  
20   222 !222h 08' 06"     04115 !+ 41' 15"           0               3  
21   222 !222h 15' 30"     03041 !+ 30 '41"           3               6  
22   226 !226h 18' 21"     03536 !+ 35' 36"           4              15  
23   219 !219h 10' 18"     05420 !+ 54' 20"           4              13  
24   238 !238h 44' 25"  12225 !+ 1h 22' 25"           2               8  
25   198 !198h 16' 42"  14841 !+ 1h 48' 41"           3              14  
26   192 !192h 48' 58"     05007 !+ 50' 07"           5              22  
27   186 !186h 39' 15"      04423 !+44' 23"           1              16  
28   172 !172h 12' 16"     01413 !+ 14' 13"           2              13  
29   177 !177h 10' 03"     01256 !+ 12' 56"           1              16  
..                 ...                  ...         ...             ...  
86    084 !84h 27' 53"      00713 !+ 7' 13"           1              11  
87    087 !87h 38' 35"          00008 !+ 8"           3               8  
88    090 !90h 43' 20"      00216 !+ 2' 16"           0               2  
89   101 !101h 01' 20"      00336 !+ 3' 36"           2              10  
90   100 !100h 49' 30"      00435 !+ 4' 35"           3              10  
91    095 !95h 57' 09"      00459 !+ 4' 59"           2              14  
92   103 !103h 38' 38"      00539 !+ 5' 39"           1              13  
93    092 !92h 44' 59"      00435 !+ 4' 35"           2              13  
94    095 !95h 57' 16"      00141 !+ 1' 41"           2              13  
95   100 !100h 30' 35"      00909 !+ 9' 09"           2              12  
96    092 !92h 49' 46"      00321 !+ 3' 21"           2               7  
97    091 !91h 32' 16"      00737 !+ 7' 37"           4              15  
98    092 !92h 33' 08"      00602 !+ 6' 02"           1              12  
99    086 !86h 17' 28"      00644 !+ 6' 44"           4               8  
100   082 !82h 05' 12"      00717 !+ 7' 17"           4              11  
101   083 !83h 41' 12"      00101 !+ 1' 01"           1              13  
102   083 !83h 36' 02"      00619 !+ 6' 19"           5               7  
103   086 !86h 15' 02"      00440 !+ 4' 40"           1              17  
104   089 !89h 40' 27"         00032 !+ 32"           0               8  
105   091 !91h 00' 26"         00023 !+ 23"           1               4  
106   087 !87h 52' 52"         00058 !+ 58"           1               5  
107   085 !85h 48' 35"      00411 !+ 4' 11"           2               7  
108   091 !91h 59' 27"      00122 !+ 1' 22"           2              12  
109   086 !86h 12' 22"      00134 !+ 1' 34"           1               2  
110   087 !87h 34' 47"      00321 !+ 3' 21"           2              14  
111   083 !83h 56' 20"      00420 !+ 4' 20"           3              14  
112   089 !89h 59' 06"      00737 !+ 7' 37"           4              19  
113   084 !84h 46' 14"      00112 !+ 1' 12"           1              16  
114   089 !89h 04' 48"      00405 !+ 4' 05"           2              14  
115   086 !86h 20' 55"         00054 !+ 54"           0              15  

[116 rows x 9 columns]]

这一切都很好,但问题是它似乎没有按行区分。例如,当我尝试仅打印第一行时,它会重新打印整个数据集。这是一个尝试只打印第一行和第二列的示例(所以应该只是一个值):

print(data[0][2])

0            Country
1             France
2             France
3             France
4             France
5             France
6             France
7         Luxembourg
8             France
9             France
10           Belgium
11           Belgium
12           Belgium
13       World War I
14               NaN
15               NaN
16               NaN
17           Belgium
18           Belgium
19           Belgium
20           Belgium
21            France
22             Italy
23             Italy
24           Belgium
25        Luxembourg
26        Luxembourg
27           Belgium
28            France
29            France
           ...      
86             Spain
87     United States
88     United States
89             Spain
90             Spain
91             Spain
92             Spain
93             Spain
94           Denmark
95           Germany
96             Italy
97     United States
98     United States
99     United States
100    United States
101    United States
102    United States
103    United States
104            Spain
105            Spain
106            Spain
107            Spain
108       Luxembourg
109        Australia
110    Great Britain
111    Great Britain
112            Italy
113    Great Britain
114    Great Britain
115    Great Britain
Name: 1, Length: 116, dtype: object

我想要的只是表现为数据帧,包含116行和9列。知道如何解决这个问题吗?

3 个答案:

答案 0 :(得分:4)

如果我们查看文档here,我们可以看到read_html实际上输出了DataFrames的列表,而不是单个DataFrame。我们在运行时可以确认这一点:

>> print(type(data))
<class 'list'>

列表的格式是列表的第一个元素是包含您的值的实际DataFrame。

>> print(type(data[0]))
<class 'pandas.core.frame.DataFrame'>

对此的简单解决方案是将data重新分配给data[0]。然后,您可以调用各行。 DataFrames的行索引与普通列表的行为不同,因此我建议您查看.iloc.locThis是一篇关于DataFrames索引的文章。

此解决方案的一个示例:

>> data = data[0]
>> print(data.iloc[1])
0                           1903
1                         France
2    Garin, MauriceMaurice Garin
3                   La Française
4            2,428 km (1,509 mi)
5               094 !94h 33' 14"
6            24921 !+ 2h 59' 21"
7                              3
8                              6
Name: 1, dtype: object

答案 1 :(得分:2)

pandas函数read_html返回数据帧列表。因此,在您的情况下,我认为您需要选择返回列表的第一个索引,如下面代码中的第8行所示。

另请注意,您在BeautifulSoup的导入行中有拼写错误,请在问题中相应更新您的代码。

我希望我的输出是您正在寻找的。

代码:

import requests
import pandas as pd
from bs4 import BeautifulSoup

response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners")
parser = BeautifulSoup(response.content, 'html.parser')
winners_table = parser.find_all('table')[1]
data = pd.read_html(str(winners_table), flavor = 'lxml')[0]
print("type of variable data: " + str(type(data)))
print(data[0][2])

输出:

type of variable data: <class 'pandas.core.frame.DataFrame'>

1904

注意我使用lxml代替html5lib

答案 2 :(得分:1)

你可以试试这个:

df = data[0]
# iterate through the data frame using iterrows()

for index, row in df.iterrows():
    print ("Col1:", row[0], " Col2: ", row[1], "Col3:", row[2], "Col4:", row[3]) #etc for all cols

我希望这有帮助!