从网页上的表格中抓取特定范围

时间:2020-03-16 01:44:24

标签: python web-scraping

我正在尝试从该网站抓取数据,该网站提供了不同类别的游戏积分表。我想在24列中总共创建24个类别。在示例网页中,有5个(生产,设计,工程和感谢)。

如果它们具有不同的类,但是它们都具有相同的h3类:“干净”,那将很容易。不同的页面具有不同的类别,并且根据页面的顺序也会更改。最重要的是,我所需的信息实际上在表的下一行和其他类中。

因此,我想出的是,如果我可以针对每个类别创建24个if语句来查找h3类:“ clean”是否具有任何类别,那么我可以抓取所需的类,否则不放置任何类。但问题是所有人都共享同一个类。因此,我认为我可以尝试使用td colspan =“ 5”作为python的标记,以便让python知道每个类别何时结束和开始。

我的问题是,当遇到td colspan =“ 5”并停止??时,有没有办法对其进行刮擦?

import bs4 as bs
import urllib.request


gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"

req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]

for credits in infopage:
        niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
        name = niceHeaderTitle[0].text

        Titles = credits.find_all("h3", {"class":"clean"})

        Titles = [title.get_text() for title in Titles]

        if 'Business' in Titles:

            businessinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            business = businessinfo[0].get_text(strip=True)


        else:
            business = 'none'


        if 'Production' in Titles:

            productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            production = productioninfo[0].get_text(strip=True)


        else:
            production = 'none'

        if 'Design' in Titles:

            designinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            design = designinfo[0].get_text(strip=True)


        else:
            design = 'none'

        if 'Writers' in Titles:

            writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            writers = writersinfo[0].get_text(strip=True)


        else:
            writers = 'none'            

        if 'Writers' in Titles:

            writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            writers = writersinfo[0].get_text(strip=True)


        else:
            writers = 'none'

        if 'Programming/Engineering' in Titles:

            programinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            program = programinfo[0].get_text(strip=True)


        else:
            video = 'none' 

        if 'Video/Cinematics' in Titles:

            videoinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            video = videoinfo[0].get_text(strip=True)


        else:
            video = 'none'   

        if 'Audio' in Titles:

            Audioinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            audio = Audioinfo[0].get_text(strip=True)


        else:
            audio = 'none' 

        if 'Art/Graphics' in Titles:

            artinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            art = artinfo[0].get_text(strip=True)


        else:
            art = 'none'             


        if 'Support' in Titles:

            supportinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            support = supportinfo[0].get_text(strip=True)


        else:
            support = 'none' 

        if 'Thanks' in Titles:

            thanksinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            thanks = thanksinfo[0].get_text(strip=True)


        else:
            thanks = 'none'             

        games=[name,business,production,design,writers,video,audio,art,support,program,thanks]

        core_list.append(games)            

print (core_list)

0 个答案:

没有答案