我无法理解如何处理多个网址。这是我到目前为止所尝试的,但它只是从列表中抓取最后一个URL:
from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
with open('urls.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
line = url
site = urlopen(url)
soup = BeautifulSoup(site)
for td in soup.find_all('td', {'class': 'subjectCell'}):
print td.find('a').text
答案 0 :(得分:2)
这些代码应该在for循环中
site = urlopen(url)
soup = BeautifulSoup(site)
for td in soup.find_all('td', {'class': 'subjectCell'}):
print td.find('a').text
然后他们会打电话给每个网址。
答案 1 :(得分:0)
如果要遍历所有URL,则必须将处理每个URL的代码放入循环中。但你还没有这样做。你所拥有的只是:
for url in urls:
line = url
这会一遍又一遍地重新分配变量url
和line
,最后让它们都指向最后一个网址。然后,当您在循环外调用size = urlopen(url)
时,它将在最后一个URL上运行。
试试这个:
with open('urls.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urlopen(url)
soup = BeautifulSoup(site)
for td in soup.find_all('td', {'class': 'subjectCell'}):
print td.find('a').text
答案 2 :(得分:0)
您需要将每个网址的所有内容放入for
循环:
with open('urls.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urlopen(url)
soup = BeautifulSoup(site)
for td in soup.find_all('td', {'class': 'subjectCell'}):
print td.find('a').text