我正在尝试使用此代码从以下网址中抓取信息:http://www.mobygames.com/game/xbox360/wheelman/credits;
# Imports
import requests
from bs4 import BeautifulSoup
credit_link = "http://www.mobygames.com/game/xbox360/wheelman/credits"
response = requests.get(credit_link)
soup = BeautifulSoup(response.text, "lxml")
credit_infor= soup.find("div", class_="col-md-8 col-lg-8")
credit_infor1 = credit_infor.select('table[summary="List of Credits"]')[0].find_all('tr')
这是我需要的格式:
info credit_to studio game console
starring 138920 starring Wheelman Xbox 360
Studio Heads 151851 Midway Newcastle Studio Wheelman Xbox 360
Studio Heads 73709 Midway Newcastle Studio Wheelman Xbox 360
其中info对应于每行中的第一个“td”,credit_to对应于特定贡献者的id(例如,138920是Vin Diesel的id),starring对应于标题。我想我可以处理所有事情,除了在每一行附近获得工作室名称(即标题)(稍后将从Midway Newcastle Studio切换到San Diego QA Team,依此类推)。我怎么能这样做?
答案 0 :(得分:1)
根据您的计划,credit_infor1
将包含所有tr
代码(行)的列表。如果检查HTML,其中包含标题(studio)的行,则它们没有class
属性。对于所有其他行,它们具有class="crln"
属性。
因此,您可以使用has_attr()
函数(在文档中隐藏一些)来迭代所有行并检查当前行是否有class
作为属性。如果该属性不存在,请更改标题,否则继续抓取其他数据。
继续你的计划:
studio = ''
for row in credit_infor1:
if not row.has_attr('class'):
studio = row.h2.text
continue
# get other values that you want from this row below
info = row.find('td').text
# similarly get all the other values you need each time
print(info + ' | ' + studio)
部分输出:
Starring | Starring
Studio Heads | Midway Newcastle Studio
Executive Producers | Midway Newcastle Studio
Technical Directors | Midway Newcastle Studio
Lead Programmers | Midway Newcastle Studio
...
QA Manager | San Diego QA Team
Compliance QA Manager | San Diego QA Team
QA Data Analyst | San Diego QA Team
...
SQA Analyst | SQS India QA
QA Team | SQS India QA
Executive Producers | Tigon Studios
Head of Game Production | Tigon Studios
...