我正在尝试将所有桌子拉在一起。我可以获取第一组数据,我认为这意味着抓取方面有效,但是,当我尝试将所有数据组合在一起时,我认为这是一个问题。
我尝试尽早声明数据帧,然后让表数据在每个循环中填充它。
names = {'Iron-Man',
'Incredible-Hulk-The',
'Thor',
'Iron-Man-2',
'Captain-America-The-First-Avenger',
'Avengers-The-(2012)',
'Iron-Man-3',
'Thor-The-Dark-World',
'Captain-America-The-Winter-Soldier',
'Guardians-of-the-Galaxy',
'Avengers-Age-of-Ultron',
'Ant-Man',
'Captain-America-Civil-War',
'Doctor-Strange-(2016)',
'Guardians-of-the-Galaxy-Vol-2',
'Spider-Man-Homecoming',
'Thor-Ragnarok',
'Black-Panther',
'Avengers-Infinity-War',
'Ant-Man-and-the-Wasp',
'Captain-Marvel-(2019)',
'Avengers-Endgame-(2019)'
}
这段代码适用于获取页面表
data = requests.get('https://www.the-numbers.com/movie/Iron-Man#tab=box- office')
soup = BeautifulSoup(data.text, 'html.parser')
data = []
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df = dfs[0]
df
我希望这段代码可以遍历所有内容并抓住它。
for name in names:
print(name)
data = requests.get('https://www.the-numbers.com/movie/' + name + '#tab=box-office')
soup = BeautifulSoup(data.text, 'html.parser')
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df2 = dfs[0]
df2
df.append(df2)
print(name)
df
所有电影都打印两次,所以我知道至少到了每一页。这是不包含任何其他电影的输出。
df Output:
Date Rank Gross % Change Theaters Per Theaters Total Gross Week movie
0 May 2, 2008 1 $102,118,668 NaN 4105 $24,877 $102,118,668 1 Iron-Man
1 May 9, 2008 1 $51,190,629 -50% 4111 $12,452 $177,825,024 2 Iron-Man
2 May 16, 2008 2 $31,838,996 -38% 4154 $7,665 $223,124,385 3 Iron-Man
3 May 23, 2008 3 $20,447,253 -36% 3915 $5,223 $252,614,669 4 Iron-Man
4 May 30, 2008 4 $13,541,264 -34% 3650 $3,710 $276,166,336 5 Iron-Man
5 Jun 6, 2008 6 $7,477,439 -45% 2931 $2,551 $288,847,640 6 Iron-Man
6 Jun 13, 2008 7 $5,620,375 -25% 2403 $2,339 $297,918,329 7 Iron-Man
7 Jun 20, 2008 9 $4,030,272 -28% 1912 $2,108 $304,816,141 8 Iron-Man
8 Jun 27, 2008 11 $2,257,113 -44% 1379 $1,637 $309,179,318 9 Iron-Man
9 Jul 4, 2008 12 $1,459,613 -35% 1019 $1,432 $311,708,133 10 Iron-Man
10 Jul 11, 2008 14 $939,134 -36% 710 $1,323 $313,421,025 11 Iron-Man
11 Jul 18, 2008 16 $451,838 -52% 375 $1,205 $314,376,968 12 Iron-Man
12 Jul 25, 2008 22 $310,654 -31% 274 $1,134 $314,925,955 13 Iron-Man
13 Aug 1, 2008 16 $580,179 +87% 407 $1,426 $315,687,768 14 Iron-Man
14 Aug 8, 2008 19 $426,502 -26% 45 $1,236 $316,468,817 15 Iron-Man
15 Aug 15, 2008 23 $341,178 -20% 315 $1,083 $317,058,295 16 Iron-Man
16 Aug 22, 2008 29 $243,342 -29% 257 $947 $317,473,452 17 Iron-Man
17 Aug 29, 2008 33 $223,636 -8% 220 $1,017 $317,794,156 18 Iron-Man
18 Sep 5, 2008 38 $126,734 -43% 205 $618 $318,006,770 19 Iron-Man
19 Sep 12, 2008 39 $94,816 -25% 156 $608 $318,134,740 20 Iron-Man
20 Sep 19, 2008 43 $59,037 -38% 124 $476 $318,219,154 21 Iron-Man
21 Sep 26, 2008 48 $58,364 -1% 121 $482 $318,298,180 22 Iron-Man
我一直希望将其他页面中的所有表添加到df中。不知道我要去哪里错了。
编辑:因此,我摆脱了尝试获取数据的第一次尝试,只是使用了一堆elif语句来创建所有22个数据帧。感谢大家的建议。
答案 0 :(得分:0)
无需执行所有的elif语句。要将循环中的当前df附加到最终结果df中,您需要使用df = df.append(df2)
。
import requests
import pandas as pd
from bs4 import BeautifulSoup
names = {'Iron-Man',
'Incredible-Hulk-The',
'Thor',
'Iron-Man-2',
'Captain-America-The-First-Avenger',
'Avengers-The-(2012)',
'Iron-Man-3',
'Thor-The-Dark-World',
'Captain-America-The-Winter-Soldier',
'Guardians-of-the-Galaxy',
'Avengers-Age-of-Ultron',
'Ant-Man',
'Captain-America-Civil-War',
'Doctor-Strange-(2016)',
'Guardians-of-the-Galaxy-Vol-2',
'Spider-Man-Homecoming',
'Thor-Ragnarok',
'Black-Panther',
'Avengers-Infinity-War',
'Ant-Man-and-the-Wasp',
'Captain-Marvel-(2019)',
'Avengers-Endgame-(2019)'
}
df = pd.DataFrame()
for name in names:
print(name)
url = 'https://www.the-numbers.com/movie/' + name + '#tab=box-office'
data = requests.get(url)
soup = BeautifulSoup(data.text, 'html.parser')
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df2 = dfs[0]
df2['movie'] = name
df = df.append(df2)
print(name)
df = df.reset_index(drop=True)
输出:
print (df)
Date Rank ... Week movie
0 Mar 8, 2019 1 ... 1 Captain-Marvel-(2019)
1 Mar 15, 2019 1 ... 2 Captain-Marvel-(2019)
2 Mar 22, 2019 2 ... 3 Captain-Marvel-(2019)
3 Mar 29, 2019 3 ... 4 Captain-Marvel-(2019)
4 Apr 5, 2019 5 ... 5 Captain-Marvel-(2019)
5 Apr 12, 2019 6 ... 6 Captain-Marvel-(2019)
6 Apr 19, 2019 4 ... 7 Captain-Marvel-(2019)
7 Apr 26, 2019 2 ... 8 Captain-Marvel-(2019)
8 Apr 27, 2018 1 ... 1 Avengers-Infinity-War
9 May 4, 2018 1 ... 2 Avengers-Infinity-War
10 May 11, 2018 1 ... 3 Avengers-Infinity-War
11 May 18, 2018 2 ... 4 Avengers-Infinity-War
12 May 25, 2018 3 ... 5 Avengers-Infinity-War
13 Jun 1, 2018 4 ... 6 Avengers-Infinity-War
14 Jun 8, 2018 5 ... 7 Avengers-Infinity-War
15 Jun 15, 2018 8 ... 8 Avengers-Infinity-War
16 Jun 22, 2018 9 ... 9 Avengers-Infinity-War
17 Jun 29, 2018 12 ... 10 Avengers-Infinity-War
18 Jul 6, 2018 15 ... 11 Avengers-Infinity-War
19 Jul 13, 2018 16 ... 12 Avengers-Infinity-War
20 Jul 20, 2018 20 ... 13 Avengers-Infinity-War
21 Jul 27, 2018 21 ... 14 Avengers-Infinity-War
22 Aug 3, 2018 24 ... 15 Avengers-Infinity-War
23 Aug 10, 2018 26 ... 16 Avengers-Infinity-War
24 Aug 17, 2018 37 ... 17 Avengers-Infinity-War
25 Aug 24, 2018 42 ... 18 Avengers-Infinity-War
26 Aug 31, 2018 44 ... 19 Avengers-Infinity-War
27 Sep 7, 2018 52 ... 20 Avengers-Infinity-War
28 Apr 26, 2019 1 ... 1 Avengers-Endgame-(2019)
29 May 5, 2017 1 ... 1 Guardians-of-the-Galaxy-Vol-2
.. ... ... ... ... ...
367 Aug 1, 2008 16 ... 14 Iron-Man
368 Aug 8, 2008 19 ... 15 Iron-Man
369 Aug 15, 2008 23 ... 16 Iron-Man
370 Aug 22, 2008 29 ... 17 Iron-Man
371 Aug 29, 2008 33 ... 18 Iron-Man
372 Sep 5, 2008 38 ... 19 Iron-Man
373 Sep 12, 2008 39 ... 20 Iron-Man
374 Sep 19, 2008 43 ... 21 Iron-Man
375 Sep 26, 2008 48 ... 22 Iron-Man
376 Jul 7, 2017 1 ... 1 Spider-Man-Homecoming
377 Jul 14, 2017 2 ... 2 Spider-Man-Homecoming
378 Jul 21, 2017 3 ... 3 Spider-Man-Homecoming
379 Jul 28, 2017 5 ... 4 Spider-Man-Homecoming
380 Aug 4, 2017 6 ... 5 Spider-Man-Homecoming
381 Aug 11, 2017 7 ... 6 Spider-Man-Homecoming
382 Aug 18, 2017 7 ... 7 Spider-Man-Homecoming
383 Aug 25, 2017 7 ... 8 Spider-Man-Homecoming
384 Sep 1, 2017 7 ... 9 Spider-Man-Homecoming
385 Sep 8, 2017 7 ... 10 Spider-Man-Homecoming
386 Sep 15, 2017 9 ... 11 Spider-Man-Homecoming
387 Sep 22, 2017 11 ... 12 Spider-Man-Homecoming
388 Sep 29, 2017 18 ... 13 Spider-Man-Homecoming
389 Oct 6, 2017 20 ... 14 Spider-Man-Homecoming
390 Oct 13, 2017 20 ... 15 Spider-Man-Homecoming
391 Oct 20, 2017 27 ... 16 Spider-Man-Homecoming
392 Oct 27, 2017 33 ... 17 Spider-Man-Homecoming
393 Nov 3, 2017 37 ... 18 Spider-Man-Homecoming
394 Nov 10, 2017 42 ... 19 Spider-Man-Homecoming
395 Nov 17, 2017 46 ... 20 Spider-Man-Homecoming
396 Nov 24, 2017 51 ... 21 Spider-Man-Homecoming
[397 rows x 9 columns]