Question

我正在研究我的第一个真实项目。我试图从rotoworld's news feed抓取所有nfl播放器新闻：

我已经成功地使用BeautifulSoup中的bs4从第一页中提取了我想要的所有信息，但我正在研究如何从＆＃34;较早的＆＃34;中获取信息。标签。我认为，如果每次打开新页面时网址都会更改，那么这样做很容易，但事实并非如此。我想知道是否有人有任何关于刮擦的提示＆＃34;下一页＆＃34;与BS，或者我应该尝试像斗争一样的程序？

我正在使用python 3.这是我感兴趣的代码。

    from urllib.request import urlopen as uReq
    from bs4 import BeautifulSoup as soup

    my_url="http://www.rotoworld.com/playernews/nfl/football/"

    # opening up connection, grabing the page 
    uClient = uReq(my_url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs each news report

    containers = page_soup.findAll("div",{"class":"pb"})

    filename = "nfl_player_news.csv"
    f = open(filename, "w")

    headers = "Player, Position, Team, Report, More Info, date\n"

    f.write("")

    for container in containers:
        ugly_player_info = container.div.div.text.strip("\r\n")
        neat_player_info = " ".join(ugly_player_info.split())
        player = container.div.div.a.text
        position = " ".join(neat_player_info.split()[3:4])
        team = " ".join(neat_player_info.split()[5:])

        report = container.p.text.strip()

        more_info = container.findAll("div",{"class":"impact"})
        info = more_info[0].text.strip()

        date_messy = container.findAll("div",{"class":"date"})
        date_time = date_messy[0].text.strip()
        ny_date= " ".join(date_time.split()[0:2])
        date = ny_date + " 2018"

        print("player" + player) 
        print("position" + position) 
        print("team" + team) 
        print("report" + report) 
        print("info" + info) 
        print("date" + date) 

        f.write(player + "," + position + "," + team + "," + report.replace(",", "|") + "," + info.replace(",","|") + "," + date + "\n")

    f.close()

Answer 1

一般来说，对于这些问题，您希望看一下两种选择之一。

使用Selenium打开页面，然后点击＆＃34;较旧的＆＃34;每次按钮并获取新页面。
Inspect页面并转到网络标签。现在点击＆＃34; old＆＃34;按钮和你 - 在你的情况下 - 在响应中的一个帖子调用，旧的页面作为html返回。然后你可以解析它。

此外，您可以使用Scrapy预先获得大量的样板代码并加快开发速度。

刮痧＆＃34;下一页＆＃34; BeautifulSoup还是Scrapy？

1 个答案: