从html表中刮取数据,选择具有特定属性的行

时间:2018-04-04 12:33:24

标签: python beautifulsoup

我从以下网站获取信息: “http://www.mobygames.com/game/wheelman/view-moby-score”。这是我的代码

url_credit = "http://www.mobygames.com/game/wheelman/view-moby-score"
response = requests.get(url_credit, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find("table", class_="reviewList table table-striped table-condensed table-hover").select('tr[valign="top"]')
for row in table[1:]:
    print(row)
    x = soup.select('td[class="left"]').get("colspan")

我想要的输出是这样的:

platform     total_votes rating_category score  total_score
PlayStation3 None        None            None   None
Windows      6           Acting          4.2    4.1
Windows      6           AI              3.7    4.1
Windows      6           Gameplay        4.0    4.1

主要问题是在平台列上具有用于相应观察的平台名称。 我怎么能得到它?

1 个答案:

答案 0 :(得分:1)

您可以看到具有新平台的行有3列,而其他行有2列。您可以使用它来更改平台。

您可以看到 PlayStation 等行包含<td>标记的列colspan="2" class="center"。使用它来处理 PlayStation 等案例。

代码:

url_credit = "http://www.mobygames.com/game/wheelman/view-moby-score"
response = requests.get(url_credit, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find("table", class_="reviewList table table-striped table-condensed table-hover").select('tr[valign="top"]')

platform = ''
total_votes, total_score = None, None
for row in table[1:]:
    # handle cases like playstation
    if row.find('td', colspan='2', class_='center'):
        platform = row.find('td').text
        total_score, total_votes = None, None
        print('{} | {} | {} | {} | {}'.format(platform, total_votes, None, None, total_score))
        continue

    cols = row.find_all('td')
    if len(cols) == 3:
        platform = cols[0].text
        total_votes = cols[1].text
        total_score = cols[2].text
        continue
    print('{} | {} | {} | {} | {}'.format(platform, total_votes, cols[0].text, cols[1].text, total_score))

输出:

PlayStation 3 | None | None | None | None
Windows | 6 |       Acting | 4.2 | 4.1
Windows | 6 |       AI | 3.7 | 4.1
Windows | 6 |       Gameplay | 4.0 | 4.1
Windows | 6 |       Graphics | 4.2 | 4.1
Windows | 6 |       Personal Slant | 4.3 | 4.1
Windows | 6 |       Sound / Music | 4.3 | 4.1
Windows | 6 |       Story / Presentation | 3.8 | 4.1
Xbox 360 | 5 |       Acting | 3.8 | 3.5
Xbox 360 | 5 |       AI | 3.2 | 3.5
Xbox 360 | 5 |       Gameplay | 3.4 | 3.5
Xbox 360 | 5 |       Graphics | 3.6 | 3.5
Xbox 360 | 5 |       Personal Slant | 3.6 | 3.5
Xbox 360 | 5 |       Sound / Music | 3.4 | 3.5
Xbox 360 | 5 |       Story / Presentation | 3.8 | 3.5

注意:通过 print ,我的意思是将这些值保存在您正在使用的任何列表/数据框架中。我只是使用print()来展示如何在需要时更改platform变量。