我正在尝试从该网站抓取数据,该网站提供了不同类别的游戏积分表。我想在24列中总共创建24个类别。在示例网页中,有5个(生产,设计,工程和感谢)。
如果它们具有不同的类,但是它们都具有相同的h3类:“干净”,那将很容易。不同的页面具有不同的类别,并且根据页面的顺序也会更改。最重要的是,我所需的信息实际上在表的下一行和其他类中。
因此,我想出的是,如果我可以针对每个类别创建24个if语句来查找h3类:“ clean”是否具有任何类别,那么我可以抓取所需的类,否则不放置任何类。但问题是所有人都共享同一个类。因此,我认为我可以尝试使用td colspan =“ 5”作为python的标记,以便让python知道每个类别何时结束和开始。
我的问题是,当遇到td colspan =“ 5”并停止??时,有没有办法对其进行刮擦?
import bs4 as bs
import urllib.request
gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"
req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]
for credits in infopage:
niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
name = niceHeaderTitle[0].text
Titles = credits.find_all("h3", {"class":"clean"})
Titles = [title.get_text() for title in Titles]
if 'Business' in Titles:
businessinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
business = businessinfo[0].get_text(strip=True)
else:
business = 'none'
if 'Production' in Titles:
productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
production = productioninfo[0].get_text(strip=True)
else:
production = 'none'
if 'Design' in Titles:
designinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
design = designinfo[0].get_text(strip=True)
else:
design = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Programming/Engineering' in Titles:
programinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
program = programinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Video/Cinematics' in Titles:
videoinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
video = videoinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Audio' in Titles:
Audioinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
audio = Audioinfo[0].get_text(strip=True)
else:
audio = 'none'
if 'Art/Graphics' in Titles:
artinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
art = artinfo[0].get_text(strip=True)
else:
art = 'none'
if 'Support' in Titles:
supportinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
support = supportinfo[0].get_text(strip=True)
else:
support = 'none'
if 'Thanks' in Titles:
thanksinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
thanks = thanksinfo[0].get_text(strip=True)
else:
thanks = 'none'
games=[name,business,production,design,writers,video,audio,art,support,program,thanks]
core_list.append(games)
print (core_list)