Question

一周左右的编程新手，使用BeautifulSoup和https://cagematch.net在scraper上使用刮刀获取摔跤元数据。

这是我的代码：

from BeautifulSoup import BeautifulSoup
import urllib2

link = "https://www.cagematch.net/?id=8&nr=12&page=4"
print link
url = urllib2.urlopen(link) #Cagematch URL for PWG Events
content = url.read()
soup = BeautifulSoup(content)

events = soup.findAll("tr", { "class" : "TRow" }) #Captures all event classes into a list, each event on site is separated by '<tr class="TRow">'

for i in events[1:12]: #For each event, only searches over a years scope
  data = i.findAll("td", { "class" : "TCol TColSeparator"}) #Captures each class on an event into a list item, separated by "<td class="TCol TColSeparator>"
  date = data[0].text #Grabs Date of show, date of show is always first value of "data" list
  show = data[1].text #Grabs name of show, name of show is always second value of "data" list
  status = data[2].text #Grabs event type, if "Event (Card)" show hasn't occurred, if "Event" show has occurred.

  print date, show, status

  if status == "Event": #If event has occurred, get card data
    print "Event already taken place"
    link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
    print content

所以这个想法是：

从原始链接搜索所有活动列表，获取节目日期，显示名称和状态已经发生。
如果发生了事件，请转到卡页面并获取卡片内容（尚未完成，截至目前的功能是打印该卡片页面）。

1完美地工作，它到网站很好，并得到它需要的东西。 2没有。

我重新宣布我的＆＃34;链接＆＃34; if语句中的变量。链接变量更改为正确的链接。但是，当我再次尝试打印内容时，它仍然会从我最初声明链接时进入原始页面。

如果我重新声明它可以使用的所有变量，但肯定还有另一种方法可以做到这一点吗？

Answer 1

您不会仅通过重新定义link变量来触发更改页面内容 - 您必须从新链接请求并下载页面：

link = 'https://cagematch.net/' + data[4].find("a", href=True)['href']
url = urllib2.urlopen(link) 
content = url.read()

其他一些说明：

您使用的是过时的BeautifulSoup版本3.更新至BeautifulSoup 4：
```
pip install beautifulsoup4 --upgrade
```
并将导入更改为：
```
from bs4 import BeautifulSoup
```
您可以通过切换到requests并将同一会话重复用于同一个域的多个请求来提高性能
建议使用urljoin()连接网址的各个部分

Python - 更改BeautifulSoup网址的显示内容？

1 个答案: