一周左右的编程新手,使用BeautifulSoup和https://cagematch.net在scraper上使用刮刀获取摔跤元数据。
这是我的代码:
from BeautifulSoup import BeautifulSoup
import urllib2
link = "https://www.cagematch.net/?id=8&nr=12&page=4"
print link
url = urllib2.urlopen(link) #Cagematch URL for PWG Events
content = url.read()
soup = BeautifulSoup(content)
events = soup.findAll("tr", { "class" : "TRow" }) #Captures all event classes into a list, each event on site is separated by '<tr class="TRow">'
for i in events[1:12]: #For each event, only searches over a years scope
data = i.findAll("td", { "class" : "TCol TColSeparator"}) #Captures each class on an event into a list item, separated by "<td class="TCol TColSeparator>"
date = data[0].text #Grabs Date of show, date of show is always first value of "data" list
show = data[1].text #Grabs name of show, name of show is always second value of "data" list
status = data[2].text #Grabs event type, if "Event (Card)" show hasn't occurred, if "Event" show has occurred.
print date, show, status
if status == "Event": #If event has occurred, get card data
print "Event already taken place"
link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
print content
所以这个想法是:
1完美地工作,它到网站很好,并得到它需要的东西。 2没有。
我重新宣布我的&#34;链接&#34; if语句中的变量。链接变量更改为正确的链接。但是,当我再次尝试打印内容时,它仍然会从我最初声明链接时进入原始页面。
如果我重新声明它可以使用的所有变量,但肯定还有另一种方法可以做到这一点吗?
答案 0 :(得分:2)
您不会仅通过重新定义link
变量来触发更改页面内容 - 您必须从新链接请求并下载页面:
link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
url = urllib2.urlopen(link)
content = url.read()
其他一些说明:
您使用的是过时的BeautifulSoup
版本3.更新至BeautifulSoup
4:
pip install beautifulsoup4 --upgrade
并将导入更改为:
from bs4 import BeautifulSoup
您可以通过切换到requests
并将同一会话重复用于同一个域的多个请求来提高性能
建议使用urljoin()
连接网址的各个部分