从网上获取错误的文字与beautifulsoup刮

时间:2018-06-18 13:11:30

标签: html python-3.x beautifulsoup

当我抓住这个网址时,我收到了错误的文字:

http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?view=detailed&sort=desc&year_selected=2018

这就是我所拥有的

from requests import get
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd

#Define year
year_number = 2018

# Define the URL
i = range(0, 1)

names = []
metascores = []
userscores = []
userscoresNew = []
release_dates = []
release_datesNew = []
publishers = []
ratings = []
genres = []

for element in i:

    url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?view=detailed&sort=desc&year_selected=" + format(year_number)

    print(url)

    year_number -= 1

    # not sure about this but it works (I was getting blocked by something and this the way I found around it)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

    web_byte = urlopen(req).read()

    webpage = web_byte.decode('utf-8')

    #this grabs the all the text from the page
    html_soup = BeautifulSoup(webpage, 'html5lib')

    #this is for selecting all the games in from 1 to 100 (the list of them)
    game_names = html_soup.find_all("div", class_="main_stats")
    game_metas = html_soup.find_all("a", class_="basic_stat product_score")  
    game_users = html_soup.find_all("li", class_='stat product_avguserscore')
    game_releases = html_soup.find_all("ul", class_='more_stats')
#     game_publishers = html_soup.find_all("ul", class_='more_stats')
#     game_ratings = html_soup.find_all("ul", class_='more_stats')
#     game_genres = html_soup.find_all("ul", class_='more_stats')



    #Extract data from each game
    for games in game_names:
        name = games.find()
        names.append(name.text.strip())

    for games2 in game_metas:
        metascore = games2.find()
        metascores.append(metascore.text.strip())  

    for games3 in game_releases:
        release_date = games3.find()
        release_dates.append(release_date.text.strip())

    for games4 in game_users:
        game_user = games4.find()
        userscores.append(game_user.text.strip())


#         print(name)
#         print(metascore)
#         print(userscore)

# for i in userscores:
#     temp = str(i)
#     temp2 = temp.replace("User:\n    ", "")
#     userscoresNew.append(temp2)

for x in release_dates:
    temp = str(x)
    temp2 = temp.replace("Release Date:\n                        ", "")
    release_datesNew.append(temp2)


# df = pd.DataFrame({'Games:': names,
#                     'Metascore:': metascores,
#                     'Userscore:': userscoresNew}) 

# df.to_csv("metacritic scrape.csv")

以上是寻找用户得分,但我得到的文字“用户得分:”重复100倍,当我想要的是下一组标签中的数据时,当我尝试将上述变量更改为:

 game_users = html_soup.find_all("span", class_='data textscore textscore_favorable')

运行代码时出错:

AttributeError: 'NoneType' object has no attribute 'text'

我也不认为第二种选择是一种好方法,因为当用户得分低于一定水平时,该类会在HTML上发生变化(从“data textscore textscore_favorable”变为“data textscore textscore_mixed”)

任何帮助都会被贬低

仅供参考我修改了我已编写的代码,但从更详细的视图中获取更多细节

1 个答案:

答案 0 :(得分:1)

这应该有所帮助。

pandas.to_numeric

<强>输出:

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?view=detailed&sort=desc&year_selected=2018"
html = requests.get(url, headers=headers)
html_soup = BeautifulSoup(html.text, "html.parser")
game_users = html_soup.find_all("li", class_='stat product_avguserscore')
for i in game_users:
    userScore = i.find('span', class_="data textscore textscore_favorable")
    if userScore:
        print(userScore.text)
  • 使用7.6 7.8 8.2 7.8 8.1 8.5 7.5 7.5 .... 获得分数