Question

所以我正在尝试网页抓取https://en.wikipedia.org/wiki/FIFA_World_Rankings并抓取页面上的第一个表格，但它没有奏效，我收到错误＆＃39; NoneType＆＃39;对象是可调用的。

这是我的代码：

from bs4 import BeautifulSoup
import urllib2

soup = BeautifulSoup(urllib2.urlopen("https://en.wikipedia.org/wiki/FIFA_World_Rankings").read())

for row in soup('table', {'class': 'wikitable'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string

我对HTML并不了解，而且我对网页抓取知之甚少。

Answer 1

您缺少findAll（或find_all，如果您想成为Pythonic）功能来搜索元素下的所有标签。

您可能还想对数据进行检查，以确保您不会像这样得到IndexError。

for row in soup('table', {'class': 'wikitable'})[0].findAll('tr'):
    tds = row.findAll('td')
    if len(tds) > 1:
        print tds[0].text, tds[1].text

这是它给出的输出

 Argentina 1532
 Belgium 1352
 Chile 1348
 Colombia 1337
 Germany 1309
 Spain 1277
 Brazil 1261

Answer 2

import requests
from bs4 import BeautifulSoup

request = requests.get("https://en.wikipedia.org/wiki/FIFA_World_Rankings")
sourceCode = BeautifulSoup(request.content)
tables = sourceCode.select('table.wikitable')
table = tables[0]

print table.get_text()

如果您希望将结果作为列表：

list = [text for text in table.stripped_strings]

Answer 3

这应该有效。您需要使用find_all来查找标记。此外，在Wiki文章中，团队等级出现在表格行3-22中，因此是if条件。

from bs4 import BeautifulSoup
import urllib2

soup = BeautifulSoup(urllib2.urlopen("https://en.wikipedia.org/wiki/FIFA_World_Rankings").read())

for i,row in enumerate(soup('table', {'class': 'wikitable'})[0].find_all('tr')):
    if i > 2 and i < 23:
      data = row.find_all('td')
      print i,data[0].text, data[1].text

Web Scrape in Python

3 个答案: