Question

我正在制作一个有趣的网络浏览器。基本上我想要做的就是抓取这个页面

http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=2010&view=.dateSeason

首先得到所有主队。这是我的代码：

def urslit_spider(max_years):

year = 2010
while year <= max_years:
    url = 'http://www.premierleague.com/content/premierleague/en-gb/matchday/results.html?paramClubId=ALL&paramComp_8=true&paramSeasonId=' + str(year) + '&view=.dateSeason'
    source_code = requests.get(url)
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "html.parser")
    for link in soup.findAll('a', {'class' : 'clubs rHome'}):
        lid = link.string
        print(lid)
    year += 1

我发现代码不会进入for循环。它没有给我任何错误，但它没有做任何事情。试图搜索这个，但找不到什么是错的。

Answer 1

您提供的链接会将我重定向到主页。修改我到达http://br.premierleague.com/en-gb/matchday/results.html

的网址

在此网址中，我使用

获取所有主队名称

soup.findAll('td', {'class' : 'home'}):

如何导航至您提供的链接？也许该页面上的HTML不同

修改：看起来此网站的内容是从此网址加载的：http://br.premierleague.com/pa-services/api/football/lang_en_gb/i18n/competition/fandr/api/gameweek/1.json

修改url参数，你可以找到很多信息。我仍然无法打开你提供的网址，它会一直重定向我，但在我提供的链接中，我无法从html（和BeautifulSoup）中提取表信息，因为它正在从上面的JSON收集信息。

最好的办法是使用json获取所需的信息。我的建议是使用python中的json包。

如果您不熟悉JSON，可以使用此网站使JSON更具可读性：https://jsonformatter.curiousconcept.com/

制作webcrawler - 不要进入我的for循环

1 个答案: