从html表中删除数据,选择标题之间的元素

时间:2018-04-03 15:15:50

标签: python beautifulsoup

我正在尝试使用此代码从以下网址中抓取信息:http://www.mobygames.com/game/xbox360/wheelman/credits;

# Imports
import requests
from bs4 import BeautifulSoup
credit_link = "http://www.mobygames.com/game/xbox360/wheelman/credits"
response = requests.get(credit_link)
soup = BeautifulSoup(response.text, "lxml")
credit_infor= soup.find("div", class_="col-md-8 col-lg-8")
credit_infor1 = credit_infor.select('table[summary="List of Credits"]')[0].find_all('tr')

这是我需要的格式:

info          credit_to  studio                   game       console
starring      138920     starring                 Wheelman   Xbox 360
Studio Heads  151851     Midway Newcastle Studio  Wheelman   Xbox 360
Studio Heads  73709      Midway Newcastle Studio  Wheelman   Xbox 360

其中info对应于每行中的第一个“td”,credit_to对应于特定贡献者的id(例如,138920是Vin Diesel的id),starring对应于标题。我想我可以处理所有事情,除了在每一行附近获得工作室名称(即标题)(稍后将从Midway Newcastle Studio切换到San Diego QA Team,依此类推)。我怎么能这样做?

1 个答案:

答案 0 :(得分:1)

根据您的计划,credit_infor1将包含所有tr代码(行)的列表。如果检查HTML,其中包含标题(studio)的行,则它们没有class属性。对于所有其他行,它们具有class="crln"属性。

因此,您可以使用has_attr()函数(在文档中隐藏一些)来迭代所有行并检查当前行是否有class作为属性。如果该属性不存在,请更改标题,否则继续抓取其他数据。

继续你的计划:

studio = ''
for row in credit_infor1:
    if not row.has_attr('class'):
        studio = row.h2.text
        continue

    # get other values that you want from this row below

    info = row.find('td').text
    # similarly get all the other values you need each time

    print(info + ' | ' + studio)

部分输出:

Starring | Starring
Studio Heads | Midway Newcastle Studio
Executive Producers | Midway Newcastle Studio
Technical Directors | Midway Newcastle Studio
Lead Programmers | Midway Newcastle Studio
...
QA Manager | San Diego QA Team
Compliance QA Manager | San Diego QA Team
QA Data Analyst | San Diego QA Team
...
SQA Analyst | SQS India QA
QA Team | SQS India QA
Executive Producers | Tigon Studios
Head of Game Production | Tigon Studios
...