Question

我正试图从Metacritc中提取这些链接中的“游戏名称”这是我对这段代码的了解：

from requests import get
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

url = "http://www.metacritic.com/browse/games/score/metascore/year/pc/filtered?sort=desc&year_selected=2018"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

html_soup = BeautifulSoup(webpage, 'lxml')

game_name = html_soup.find_all("div", class_="product_item product_title")

print(game_name)

这张照片（以及我想要的所有其他照片

）

<div class="product_item product_title">
    <a href="/game/pc/into-the-breach">
                        Into the Breach
                                        </a>
    </div>,

and so on ....

我想知道如何只针对其中一个而只是名称（尝试制作变量就是它们中的字符串）

我也将如何定位第二个和第三个（我尝试[1]和[2]但最终出错（见下文）...也许我做错了什么？）

当我尝试这个方法（.find（））时：

game_name = html_soup.find("div", class_="product_item product_title").text

我收到了文字，但看起来并不完美（空格和换行符）

            Into the Breach

[edit]我使用了strip（）并清理了文本

但是当我尝试

时

game_name = html_soup.find("div", class_="product_item product_title")[1].text

我收到此错误：

KeyError                                  Traceback (most recent call last)
<ipython-input-4-bd2752bc8407> in <module>()
      7 html_soup = BeautifulSoup(webpage, 'lxml')
      8 
----> 9 game_name = html_soup.find("div", class_={"product_item product_title"})[1].text
     10 
     11 print(game_name)

~/anaconda3/lib/python3.6/site-packages/bs4/element.py in __getitem__(self, key)
   1009         """tag[key] returns the value of the 'key' attribute for the tag,
   1010         and throws an exception if it's not there."""
-> 1011         return self.attrs[key]
   1012 
   1013     def __iter__(self):

KeyError: 1

请帮助，我对这个东西很新鲜

Answer 1

你想使用find_all（）而不是find（）。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

find_all（）方法查看标记的后代并检索与您的过滤器匹配的所有后代。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

唯一的区别是find_all（）返回一个包含的列表单个结果，而find（）只返回结果。

MATCH

输出：

INDEX

编辑--Python3 Beautifulsoup从网站

1 个答案: