BeautifulSoup:提取文件链接,仅在后续重新运行时附加新链接

时间:2013-11-28 11:26:37

标签: python python-2.7 hyperlink beautifulsoup urllib2

我有以下代码从特定网站提取链接。

from bs4 import BeautifulSoup
import urllib2, sys
import re

def jobsinghana():
    site = "http://www.jobsinghana.com/jobs"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    mayday = urllib2.urlopen(req)
    soup = BeautifulSoup(mayday)
    jobs = soup.find_all('a', {'class' : 'hover'})
    print str(jobs).strip('[]')


def modernghana():
    site = "http://www.modernghana.com/GhanaHome/classifieds/list_classifieds.asp?    menu_id=7&sub_menu_id=362&gender=&cat_id=173&has_price=2"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    jobs = soup.find_all('a', href = re.compile('show_classifieds'))
    for a in jobs:
        header = a.parent.find_previous_sibling('h3').text
        a.string = header
        print a

jobsinghana = jobsinghana()
modernghana = modernghana()


alllinks = open('content.html', 'w')
alllinks.write("\n".join((jobsinghana, modernghana)))
allinks.close()
  1. 最后3行是假设将提取的链接写入文件,但我收到以下错误:

    TypeError: sequence item 0: expected string, NoneType found
    
  2. 我还注意到代码在我运行程序的任何时候都会重新提取所有链接,但因为大多数链接都是在代码运行的早期实例中提取的,所以我很感兴趣提取和在后续运行时仅附加文件的新链接。

1 个答案:

答案 0 :(得分:2)

你的任何一个函数都不会返回。然后默认返回None,导致您的错误。

向您的函数添加return语句,而不是打印结果。您正在收集链接列表,因此您需要更改代码以返回列表并连接两个列表,或者将它们分别写入输出文件:

def jobsinghana():
    site = "http://www.jobsinghana.com/jobs"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    mayday = urllib2.urlopen(req)
    soup = BeautifulSoup(mayday)
    return map(str, soup.find_all('a', {'class' : 'hover'}))


def modernghana():
    site = "http://www.modernghana.com/GhanaHome/classifieds/list_classifieds.asp?    menu_id=7&sub_menu_id=362&gender=&cat_id=173&has_price=2"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    jobs = soup.find_all('a', href = re.compile('show_classifieds'))
    result = []
    for a in jobs:
        header = a.parent.find_previous_sibling('h3').text
        a.string = header
        result.append(str(a))
    return result

jobsinghana_links = jobsinghana()
modernghana_links = modernghana()


with open('content.html', 'w') as alllinks:
    alllinks.write("\n".join(jobsinghana_links + modernghana_links))

如果您需要跳过之前找到的链接,则必须阅读链接,最好是一组,以便在再次扫描时进行测试:

def read_existing():
    with open('content.html') as alllinks:
        return {line.strip() for line in alllinks}

existing = read_existing()
jobsinghana_links = jobsinghana(existing)
modernghana_links = modernghana(existing)

在阅读链接的两个函数中,使用existing测试过滤掉if link in existing:中已存在的所有链接。