权力的游戏维基百科Python刮刀

时间:2017-10-01 17:16:19

标签: python web-scraping beautifulsoup wikipedia

Hay我正在制作第二个学校项目,并且是在BeautifulSoup的帮助下成为Python Scraper。好吧,我的任务说明如下:我应该组装一个应用程序,从维基百科中删除日期并提供GoT额外季节的全部视图,如果应用程序可以做出以下内容:显示所有季节之前的总数总计,也可以按剧集和总数给出总观看情节剧集,并在总节期间给出所有的总观看次数。

喜欢那样: S01E1:2.22 Milions S02E2:2.20 Milions 。 。 。 第1季总票数:xy

总计:398,7百万

不知何故,我只管理了总计。

如果有人做了类似的事情,请帮忙:) 非常感谢:

import re
import urllib

from BeautifulSoup import BeautifulSoup

wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)

seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})

views = 0

for season in seasons:
    season_url = 'https://en.wikipedia.org' + season['href']
    season_html = urllib.urlopen(season_url).read()
    season_content = BeautifulSoup(season_html)

    episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})

    if episodes_table:
        episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})

        if episode_rows:
            for episode_row in episode_rows:
                episode_views = episode_row.findAll('td')[-1]

                views += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)

print 'The total number of views is ' + str(views) + ' millions'

2 个答案:

答案 0 :(得分:0)

解析时无需任何工作。我所要做的就是如何在屏幕上以你想要的格式输出结果,更像是字符串操作。

代码:

import re
import urllib
from bs4 import BeautifulSoup

wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html, 'html.parser')
seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})

views = 0
total = 0
season_num = 1
for season in seasons:
    season_url = 'https://en.wikipedia.org' + season['href']
    season_html = urllib.urlopen(season_url).read()
    season_content = BeautifulSoup(season_html,'html.parser')
    episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})
    if episodes_table:
        episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})
        if episode_rows:
            episode_num = 1
            for episode_row in episode_rows:
                episode_views = episode_row.findAll('td')[-1]
                views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                total += float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
                episode_num += 1
    season_num += 1

print 'The total number of views is ' + str(total) + ' millions'

输出:

S1E1 : 2.22 Millions
S1E2 : 2.2 Millions
S1E3 : 2.44 Millions
S1E4 : 2.45 Millions
S1E5 : 2.58 Millions
S1E6 : 2.44 Millions
S1E7 : 2.4 Millions
S1E8 : 2.72 Millions
S1E9 : 2.66 Millions
S1E10 : 3.04 Millions
S2E1 : 3.86 Millions
S2E2 : 3.76 Millions
S2E3 : 3.77 Millions
S2E4 : 3.65 Millions
S2E5 : 3.9 Millions
S2E6 : 3.88 Millions
S2E7 : 3.69 Millions
S2E8 : 3.86 Millions
S2E9 : 3.38 Millions
S2E10 : 4.2 Millions
S3E1 : 4.37 Millions
S3E2 : 4.27 Millions
S3E3 : 4.72 Millions
S3E4 : 4.87 Millions
S3E5 : 5.35 Millions
S3E6 : 5.5 Millions
S3E7 : 4.84 Millions
S3E8 : 5.13 Millions
S3E9 : 5.22 Millions
S3E10 : 5.39 Millions
S4E1 : 6.64 Millions
S4E2 : 6.31 Millions
S4E3 : 6.59 Millions
S4E4 : 6.95 Millions
S4E5 : 7.16 Millions
S4E6 : 6.4 Millions
S4E7 : 7.2 Millions
S4E8 : 7.17 Millions
S4E9 : 6.95 Millions
S4E10 : 7.09 Millions
S5E1 : 8.0 Millions
S5E2 : 6.81 Millions
S5E3 : 6.71 Millions
S5E4 : 6.82 Millions
S5E5 : 6.56 Millions
S5E6 : 6.24 Millions
S5E7 : 5.4 Millions
S5E8 : 7.01 Millions
S5E9 : 7.14 Millions
S5E10 : 8.11 Millions
S6E1 : 7.94 Millions
S6E2 : 7.29 Millions
S6E3 : 7.28 Millions
S6E4 : 7.82 Millions
S6E5 : 7.89 Millions
S6E6 : 6.71 Millions
S6E7 : 7.8 Millions
S6E8 : 7.6 Millions
S6E9 : 7.66 Millions
S6E10 : 8.89 Millions
S7E1 : 10.11 Millions
S7E2 : 9.27 Millions
S7E3 : 9.25 Millions
S7E4 : 10.17 Millions
S7E5 : 10.72 Millions
S7E6 : 10.24 Millions
S7E7 : 12.07 Millions
The total number of views is 398.73 millions

答案 1 :(得分:0)

你可以像阿里告诉你的那样做,除非你不应该总结它,而是输出它并在我的情况下将它加在单独的变量中:

totalViewsPerSeason

工作解决方案:

import re
import urllib

from BeautifulSoup import BeautifulSoup

wiki_url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
wiki_html = urllib.urlopen(wiki_url).read()
wiki_content = BeautifulSoup(wiki_html)

seasons_table = wiki_content.find('table', attrs={'class': 'wikitable'})
seasons = seasons_table.findAll('a', attrs={'href': re.compile('\/wiki\/Game_of_Thrones_\(season_?[0-9]+\)')})

views = 0
grandTotalViews = 0
season_num = 1

for season in seasons:
    season_url = 'https://en.wikipedia.org' + season['href']
    season_html = urllib.urlopen(season_url).read()
    season_content = BeautifulSoup(season_html)

    episodes_table = season_content.find('table', attrs={'class': 'wikitable plainrowheaders wikiepisodetable'})

    if episodes_table:
        episode_rows = episodes_table.findAll('tr', attrs={'class': 'vevent'})

        if episode_rows:
            episode_num = 1
            totalViewsPerSeason = 0
            for episode_row in episode_rows:
                episode_views = episode_row.findAll('td')[-1]

                views = float(re.sub(r'\[?[0-9]+\]', '', episode_views.text))  # here we search for numbers in the text with a help of a regex (regular expression)
                grandTotalViews += views
                totalViewsPerSeason += views
                print 'S' + str(season_num) + "E" + str(episode_num) + " : " + str(views) + " Millions"
                episode_num += 1

    print "Total season " + str(season_num) + " views: " + str(totalViewsPerSeason) + " Millions\n"
    season_num += 1

print 'The total number of views is ' + str(grandTotalViews) + ' millions'