从报纸上获取所有网页的网址列表

时间:2014-11-15 01:03:07

标签: python beautifulsoup urllib

我想提取对应网页的网址或链接列表,但是当我运行下面的代码时,它只打印我在sabah_url_given.txt中提供的网址。它实际上应该包含“http://www.dailysabah.com/politics/2014/04/30/erdogan-no-uncertainties-ahead-of-presidential-vote ,....“在Sabah_url_collection.txt ..这是我的Sabah_url_given.txt”http://www.dailysabah.com/search?query=Erdogan&page=1

这是我的代码:

from cookielib import CookieJar
import urllib2
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

try:
    text_file = open('sabah_url_given.txt', 'r')
    for line in text_file:
        print line
        soup = BeautifulSoup(opener.open(line))
        links = soup.select("ul.list")
        with open('sabah_url_collection.txt', 'a') as f:
            for link in links:
                f.write(link.get('a href') + '\n')
                url_file = open('sabah_url_collection.txt', 'r')
                for line in url_file:
                    print line

0 个答案:

没有答案