Question

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

我的代码只从文件的每个网址打开一个页面，有时会有更多页面，在这种情况下，下一页的模式将是＆amp; page = x

这是我正在谈论的那些页面：

http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track&page=7

Answer 1

您可以从next_page链接中读取 href 属性并将其添加到 urls 列表中（是的，您应该将元组更改为列表）。它可能是这样的：

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2
import urlparse

with open('urls.txt') as inf:
    urls = [line.strip() for line in inf]
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

        next_page = soup.find_all('a', {'class': 'nextlink'}):
        if next_page:
            next_page = next_page[0]
            urls.append(urlparse.urljoin(url, next_page['href']))

Answer 2

你可以创建一些东西，从页面获取所有链接并跟随它们，scrapy免费提供

您可以创建一个蜘蛛，它将跟随页面上的所有链接。假设有其他页面的分页链接，您的刮刀将自动跟随它们。

你可以通过使用beautifulsoup解析页面上的所有链接来完成同样的事情，但是为什么scrapy已经免费使用它呢？

Answer 3

我不确定我理解你的问题，但你可能会考虑创建一些与你的“下一个”模式匹配的正则表达式（http://www.tutorialspoint.com/python/python_reg_expressions.htm），然后搜索它在页面上找到的URL中。当站点内链接具有高度一致性时，我会使用这种方法。

是否有可能在网页拥有它时使刮板在额外的页面中起作用？

3 个答案: