Question

我正在开展一个网络抓取项目，我有以下项目，我打算网上搜索：

<td class="country">
  <div>
    <img alt="Niger" height="27" src="http://assets.rio2016.nbcolympics.com/country-flags/52x35/NIG.png" width="40"/>
    Niger                                          
  </div>

在这种情况下，我试图将尼日尔排除在列表之外。我有一整张桌子，我试图将所有国家拉出来。我当前的代码如下所示：

response = requests.get('http://www.nbcolympics.com/medals')
soup = BeautifulSoup(response.content, 'lxml')
for td in soup.findAll("td",{"class": "country"}):
   print(td)

这将让我获得更多信息。我只想关注表格中的国家价值。（此表包含参加奥运会的所有国家。）如果我尝试做类似的事情：

for td in soup.findAll("td",{"class": "country"}).children:

我收到以下错误消息：

Traceback (most recent call last):
File "idea.py", line 15, in <module>
  for row in soup.find_all('tr').children:
AttributeError: 'ResultSet' object has no attribute 'children'

我知道必须有一种方法让我可以通过这些方法来了解国家价值。（我可以使用get_text（）来获取国家/地区，但是它附带了更多信息。）另外，如果div值有一个类，那么我认为这样做也相当容易。谢谢你的帮助。

我也尝试过：

for td in soup.findAll("img", {"width": "40"})
      print(td)

这几乎让我得到了我想要的东西。它将打印以下内容：

<img alt="Togo" height="27" src="http://assets.rio2016.nbcolympics.com/country-flags/52x35/TOG.png" width="40"/>

然而，我没有在它之后得到国家！但我就在那里！

Answer 1

findAll返回ResultSet找到的元素，这是一个可迭代的元素。您需要遍历找到的元素并访问.text：

for element in soup.findAll("img", {"class": "country"}):
    print(element.get('alt', ''))

我已更换"td",{"class": "country"}选择器，因为您正在寻找类country的图片。

美丽的汤问题与webscraping

1 个答案: