是否可以从.txt打开多个网址并立即抓取所有网页?

时间:2012-11-08 02:09:42

标签: python file url

我无法理解如何处理多个网址。这是我到目前为止所尝试的,但它只是从列表中抓取最后一个URL:

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        line = url 

site = urlopen(url)   

soup = BeautifulSoup(site)

for td in soup.find_all('td', {'class': 'subjectCell'}):
    print td.find('a').text

3 个答案:

答案 0 :(得分:2)

这些代码应该在for循环中

site = urlopen(url)   

soup = BeautifulSoup(site)

for td in soup.find_all('td', {'class': 'subjectCell'}):
    print td.find('a').text

然后他们会打电话给每个网址。

答案 1 :(得分:0)

如果要遍历所有URL,则必须将处理每个URL的代码放入循环中。但你还没有这样做。你所拥有的只是:

for url in urls:
    line = url

这会一遍又一遍地重新分配变量urlline,最后让它们都指向最后一个网址。然后,当您在循环外调用size = urlopen(url)时,它将在最后一个URL上运行。

试试这个:

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

答案 2 :(得分:0)

您需要将每个网址的所有内容放入for循环:

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site)

        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text