我熟悉python,但之前从未尝试过从网站上抓取数据。我查找了BeautifulSoup的文档,但对html的了解不足,无法获得所需的信息。我正在尝试从此网站上的表中检索数据。
https://en.wikipedia.org/wiki/List_of_highest-grossing_films
如果我想列出每部电影的排名,标题和年份,我该怎么做?
据我所知。不太远,但这是一个开始。
url='https://en.wikipedia.org/wiki/List_of_highest-grossing_films'
resp=request.get(url)
if resp.status_code==200:
soup=BeautifulSoup(resp.content, 'html.parser')
l=soup.find_all('a')
我使用find_all('a')是因为所有标题都是可单击的。但是网页上有很多可点击的内容,因此这可能不是最佳选择。但是我想要的其他信息是不可点击的。我不知道该怎么办。
答案 0 :(得分:2)
在这里,您可以在 ANY 维基百科页面上使用它。您可以使用whichtable =选项来选择表格,因为维基百科页面可以包含多个表格。该程序将创建一个不错的csv文件以用于数据框。 :
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
def getContent(link, filename, whichtable=0):
result1 = requests.get(link)
src1 = result1.content
soup = BeautifulSoup(src1,'lxml')
table = soup.find_all('table')[whichtable]
with open(filename,'w',newline='') as f:
writer = csv.writer(f)
for tr in table('tr'):
row = [t.get_text(strip=True)for t in tr(['td','th'])]
writer.writerow(row)
getContent('https://en.wikipedia.org/wiki/List_of_highest-grossing_films', 'what.csv', whichtable=0)
df= pd.read_csv('what.csv')
df
输出
Rank Peak Title Worldwide gross Year Reference(s)
0 1 1 Avengers: Endgame $2,797,800,564 2019 [# 1][# 2]
1 2 1 Avatar $2,789,679,794 2009 [# 3][# 4]
2 3 1 Titanic $2,187,463,944 1997 [# 5][# 6]
3 4 3 Star Wars: The Force Awakens $2,068,223,624 2015 [# 7][# 8]
4 5 4 Avengers: Infinity War $2,048,359,754 2018 [# 9][# 10]
5 6 3 Jurassic World $1,671,713,208 2015 [# 11][# 12]
6 7 7 The Lion King $1,656,405,082 2019 [# 13][# 2]
7 8 3 The Avengers $1,518,812,988 2012 [# 14][# 15]
8 9 4 Furious 7 $1,516,045,911 2015 [# 16][# 17]
9 10 5 Avengers: Age of Ultron $1,405,403,694 2015 [# 18][# 17]
10 11 9 Black Panther $1,346,913,161 2018 [# 19][# 20]
11 12 3 Harry Potter and the Deathly Hallows – Part 2 $1,341,693,157 2011 [# 21][# 22]
12 13 9 Star Wars: The Last Jedi $1,332,539,889 2017 [# 23][# 24]
13 14 12 Jurassic World: Fallen Kingdom $1,309,484,461 2018 [# 25][# 10]
14 15 5 Frozen F$1,290,000,000 2013 [# 26][# 27]
15 16 10 Beauty and the Beast $1,263,521,126 2017 [# 28][# 29]
16 17 15 Incredibles 2 $1,242,805,359 2018 [# 30][# 10]
17 18 11 The Fate of the Furious F8$1,238,764,765 2017 [# 31][# 29]
18 19 5 Iron Man 3 $1,214,811,252 2013 [# 32][# 33]
19 20 10 Minions $1,159,398,397 2015 [# 34][# 12]
20 21 12 Captain America: Civil War $1,153,304,495 2016 [# 35][# 36]
21 22 20 Aquaman $1,148,161,807 2018 [# 37][# 10]
22 23 23 Spider-Man: Far From Home $1,131,927,996 2019 [# 38][# 2]
23 24 22 Captain Marvel $1,128,274,794 2019 [# 39][# 40]
24 25 4 Transformers: Dark of the Moon $1,123,794,079 2011 [# 41][# 22]
25 26 2 The Lord of the Rings: The Return of the King $1,120,237,002 2003 [# 42][# 43]
26 27 7 Skyfall $1,108,561,013 2012 [# 44][# 45]
27 28 10 Transformers: Age of Extinction $1,104,054,072 2014 [# 46][# 47]
28 29 7 The Dark Knight Rises $1,084,939,099 2012 [# 48][# 49]
29 30 30 Toy Story 4 $1,073,394,593 2019 [# 50][# 2]
30 31 4TS3 Toy Story 3 $1,066,969,703 2010 [# 51][# 52]
31 32 3 Pirates of the Caribbean: Dead Man's Chest $1,066,179,725 2006 [# 53][# 54]
32 33 33 Joker $1,057,193,906 2019 [# 55][# 56]
33 34 20 Rogue One: A Star Wars Story $1,056,057,273 2016 [14][# 57]
34 35 34 Aladdin $1,050,693,953 2019 [# 58][# 2]
35 36 6 Pirates of the Caribbean: On Stranger Tides $1,045,713,802 2011 [# 59][# 52]
36 37 24 Despicable Me 3 $1,034,799,409 2017 [# 60][# 29]
37 38 1 Jurassic Park $1,029,939,903 1993 [# 61][# 62]
38 39 22 Finding Dory $1,028,570,889 2016 [# 63][# 64]
39 40 2 Star Wars: Episode I – The Phantom Menace $1,027,044,677 1999 [# 65][# 6]
40 41 5 Alice in Wonderland $1,025,467,110 2010 [# 66][# 67]
41 42 24 Zootopia $1,023,784,195 2016 [# 68][# 36]
42 43 14 The Hobbit: An Unexpected Journey $1,021,103,568 2012 [# 69][# 70]
43 44 4 The Dark Knight $1,004,934,033 2008 [# 71][# 72]
44 45 2PS Harry Potter and the Philosopher's Stone $975,051,288 2001 [# 73][# 74]
45 46 19DM2 Despicable Me 2 $970,761,885 2013 [# 75][# 33]
46 47 2 The Lion King $968,483,777 1994 [# 76][# 62]
47 48 30 The Jungle Book $966,550,600 2016 [# 77][# 78]
48 49 5 Pirates of the Caribbean: At World's End $963,420,425 2007 [# 79][# 80]
49 50 40 Jumanji: Welcome to the Jungle $962,126,927 2017 [# 81][# 20]
答案 1 :(得分:0)
您可以尝试这样的事情:
# Parse the HTML as a string
soup = BeautifulSoup(resp.content, 'html.parser')
# Grab the first table
table = soup.find_all('table')[0]
这时,您可以开始在表对象上循环以初始化每列几个列表或包含每个表行的单个列表,然后从中创建Pandas Dataframe。