我有以下代码从特定网站提取链接。
from bs4 import BeautifulSoup
import urllib2, sys
import re
def jobsinghana():
site = "http://www.jobsinghana.com/jobs"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
mayday = urllib2.urlopen(req)
soup = BeautifulSoup(mayday)
jobs = soup.find_all('a', {'class' : 'hover'})
print str(jobs).strip('[]')
def modernghana():
site = "http://www.modernghana.com/GhanaHome/classifieds/list_classifieds.asp? menu_id=7&sub_menu_id=362&gender=&cat_id=173&has_price=2"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
soup = BeautifulSoup(jobpass)
jobs = soup.find_all('a', href = re.compile('show_classifieds'))
for a in jobs:
header = a.parent.find_previous_sibling('h3').text
a.string = header
print a
jobsinghana = jobsinghana()
modernghana = modernghana()
alllinks = open('content.html', 'w')
alllinks.write("\n".join((jobsinghana, modernghana)))
allinks.close()
最后3行是假设将提取的链接写入文件,但我收到以下错误:
TypeError: sequence item 0: expected string, NoneType found
我还注意到代码在我运行程序的任何时候都会重新提取所有链接,但因为大多数链接都是在代码运行的早期实例中提取的,所以我很感兴趣提取和在后续运行时仅附加文件的新链接。
答案 0 :(得分:2)
你的任何一个函数都不会返回。然后默认返回None
,导致您的错误。
向您的函数添加return
语句,而不是打印结果。您正在收集链接列表,因此您需要更改代码以返回列表并连接两个列表,或者将它们分别写入输出文件:
def jobsinghana():
site = "http://www.jobsinghana.com/jobs"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
mayday = urllib2.urlopen(req)
soup = BeautifulSoup(mayday)
return map(str, soup.find_all('a', {'class' : 'hover'}))
def modernghana():
site = "http://www.modernghana.com/GhanaHome/classifieds/list_classifieds.asp? menu_id=7&sub_menu_id=362&gender=&cat_id=173&has_price=2"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
soup = BeautifulSoup(jobpass)
jobs = soup.find_all('a', href = re.compile('show_classifieds'))
result = []
for a in jobs:
header = a.parent.find_previous_sibling('h3').text
a.string = header
result.append(str(a))
return result
jobsinghana_links = jobsinghana()
modernghana_links = modernghana()
with open('content.html', 'w') as alllinks:
alllinks.write("\n".join(jobsinghana_links + modernghana_links))
如果您需要跳过之前找到的链接,则必须阅读链接,最好是一组,以便在再次扫描时进行测试:
def read_existing():
with open('content.html') as alllinks:
return {line.strip() for line in alllinks}
existing = read_existing()
jobsinghana_links = jobsinghana(existing)
modernghana_links = modernghana(existing)
在阅读链接的两个函数中,使用existing
测试过滤掉if link in existing:
中已存在的所有链接。