如何在python中抓取网站上的表格

时间:2019-12-13 03:18:47

标签: python html web-scraping beautifulsoup

我熟悉python,但之前从未尝试过从网站上抓取数据。我查找了BeautifulSoup的文档,但对html的了解不足,无法获得所需的信息。我正在尝试从此网站上的表中检索数据。

https://en.wikipedia.org/wiki/List_of_highest-grossing_films

如果我想列出每部电影的排名,标题和年份,我该怎么做?

据我所知。不太远,但这是一个开始。

    url='https://en.wikipedia.org/wiki/List_of_highest-grossing_films'
    resp=request.get(url)

    if resp.status_code==200:
        soup=BeautifulSoup(resp.content, 'html.parser')
        l=soup.find_all('a')

我使用find_all('a')是因为所有标题都是可单击的。但是网页上有很多可点击的内容,因此这可能不是最佳选择。但是我想要的其他信息是不可点击的。我不知道该怎么办。

2 个答案:

答案 0 :(得分:2)

在这里,您可以在 ANY 维基百科页面上使用它。您可以使用whichtable =选项来选择表格,因为维基百科页面可以包含多个表格。该程序将创建一个不错的csv文件以用于数据框。 :

import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd

def getContent(link, filename, whichtable=0):
    result1 = requests.get(link)
    src1 = result1.content
    soup = BeautifulSoup(src1,'lxml')
    table = soup.find_all('table')[whichtable]
    with open(filename,'w',newline='') as f:
        writer = csv.writer(f)
        for tr in table('tr'):
            row = [t.get_text(strip=True)for t in tr(['td','th'])]
            writer.writerow(row)

getContent('https://en.wikipedia.org/wiki/List_of_highest-grossing_films', 'what.csv', whichtable=0)

df= pd.read_csv('what.csv')

df

输出

    Rank   Peak                                          Title   Worldwide gross  Year  Reference(s)
0      1      1                              Avengers: Endgame    $2,797,800,564  2019    [# 1][# 2]
1      2      1                                         Avatar    $2,789,679,794  2009    [# 3][# 4]
2      3      1                                        Titanic    $2,187,463,944  1997    [# 5][# 6]
3      4      3                   Star Wars: The Force Awakens    $2,068,223,624  2015    [# 7][# 8]
4      5      4                         Avengers: Infinity War    $2,048,359,754  2018   [# 9][# 10]
5      6      3                                 Jurassic World    $1,671,713,208  2015  [# 11][# 12]
6      7      7                                  The Lion King    $1,656,405,082  2019   [# 13][# 2]
7      8      3                                   The Avengers    $1,518,812,988  2012  [# 14][# 15]
8      9      4                                      Furious 7    $1,516,045,911  2015  [# 16][# 17]
9     10      5                        Avengers: Age of Ultron    $1,405,403,694  2015  [# 18][# 17]
10    11      9                                  Black Panther    $1,346,913,161  2018  [# 19][# 20]
11    12      3  Harry Potter and the Deathly Hallows – Part 2    $1,341,693,157  2011  [# 21][# 22]
12    13      9                       Star Wars: The Last Jedi    $1,332,539,889  2017  [# 23][# 24]
13    14     12                 Jurassic World: Fallen Kingdom    $1,309,484,461  2018  [# 25][# 10]
14    15      5                                         Frozen   F$1,290,000,000  2013  [# 26][# 27]
15    16     10                           Beauty and the Beast    $1,263,521,126  2017  [# 28][# 29]
16    17     15                                  Incredibles 2    $1,242,805,359  2018  [# 30][# 10]
17    18     11                        The Fate of the Furious  F8$1,238,764,765  2017  [# 31][# 29]
18    19      5                                     Iron Man 3    $1,214,811,252  2013  [# 32][# 33]
19    20     10                                        Minions    $1,159,398,397  2015  [# 34][# 12]
20    21     12                     Captain America: Civil War    $1,153,304,495  2016  [# 35][# 36]
21    22     20                                        Aquaman    $1,148,161,807  2018  [# 37][# 10]
22    23     23                      Spider-Man: Far From Home    $1,131,927,996  2019   [# 38][# 2]
23    24     22                                 Captain Marvel    $1,128,274,794  2019  [# 39][# 40]
24    25      4                 Transformers: Dark of the Moon    $1,123,794,079  2011  [# 41][# 22]
25    26      2  The Lord of the Rings: The Return of the King    $1,120,237,002  2003  [# 42][# 43]
26    27      7                                        Skyfall    $1,108,561,013  2012  [# 44][# 45]
27    28     10                Transformers: Age of Extinction    $1,104,054,072  2014  [# 46][# 47]
28    29      7                          The Dark Knight Rises    $1,084,939,099  2012  [# 48][# 49]
29    30     30                                    Toy Story 4    $1,073,394,593  2019   [# 50][# 2]
30    31   4TS3                                    Toy Story 3    $1,066,969,703  2010  [# 51][# 52]
31    32      3     Pirates of the Caribbean: Dead Man's Chest    $1,066,179,725  2006  [# 53][# 54]
32    33     33                                          Joker    $1,057,193,906  2019  [# 55][# 56]
33    34     20                   Rogue One: A Star Wars Story    $1,056,057,273  2016    [14][# 57]
34    35     34                                        Aladdin    $1,050,693,953  2019   [# 58][# 2]
35    36      6    Pirates of the Caribbean: On Stranger Tides    $1,045,713,802  2011  [# 59][# 52]
36    37     24                                Despicable Me 3    $1,034,799,409  2017  [# 60][# 29]
37    38      1                                  Jurassic Park    $1,029,939,903  1993  [# 61][# 62]
38    39     22                                   Finding Dory    $1,028,570,889  2016  [# 63][# 64]
39    40      2      Star Wars: Episode I – The Phantom Menace    $1,027,044,677  1999   [# 65][# 6]
40    41      5                            Alice in Wonderland    $1,025,467,110  2010  [# 66][# 67]
41    42     24                                       Zootopia    $1,023,784,195  2016  [# 68][# 36]
42    43     14              The Hobbit: An Unexpected Journey    $1,021,103,568  2012  [# 69][# 70]
43    44      4                                The Dark Knight    $1,004,934,033  2008  [# 71][# 72]
44    45    2PS       Harry Potter and the Philosopher's Stone      $975,051,288  2001  [# 73][# 74]
45    46  19DM2                                Despicable Me 2      $970,761,885  2013  [# 75][# 33]
46    47      2                                  The Lion King      $968,483,777  1994  [# 76][# 62]
47    48     30                                The Jungle Book      $966,550,600  2016  [# 77][# 78]
48    49      5       Pirates of the Caribbean: At World's End      $963,420,425  2007  [# 79][# 80]
49    50     40                 Jumanji: Welcome to the Jungle      $962,126,927  2017  [# 81][# 20]

答案 1 :(得分:0)

您可以尝试这样的事情:

    # Parse the HTML as a string
    soup = BeautifulSoup(resp.content, 'html.parser') 
    # Grab the first table
    table = soup.find_all('table')[0]

这时,您可以开始在表对象上循环以初始化每列几个列表或包含每个表行的单个列表,然后从中创建Pandas Dataframe。