我想提取对应网页的网址或链接列表,但是当我运行下面的代码时,它只打印我在sabah_url_given.txt中提供的网址。它实际上应该包含“http://www.dailysabah.com/politics/2014/04/30/erdogan-no-uncertainties-ahead-of-presidential-vote ,....“在Sabah_url_collection.txt ..这是我的Sabah_url_given.txt”http://www.dailysabah.com/search?query=Erdogan&page=1“
这是我的代码:
from cookielib import CookieJar
import urllib2
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
try:
text_file = open('sabah_url_given.txt', 'r')
for line in text_file:
print line
soup = BeautifulSoup(opener.open(line))
links = soup.select("ul.list")
with open('sabah_url_collection.txt', 'a') as f:
for link in links:
f.write(link.get('a href') + '\n')
url_file = open('sabah_url_collection.txt', 'r')
for line in url_file:
print line