我有以下代码从网页中提取某些链接:
from bs4 import BeautifulSoup
import urllib2, sys
import re
def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
print soup.find_all('h2')
链接包含在'h2'标签中,因此我得到如下链接:
<h2><a href="/en/cashiers-accra">cashiers </a></h2>
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
但是我有兴趣摆脱所有'h2'标签,这样我才能以这种方式链接:
<a href="/en/cashiers-accra">cashiers </a>
<a href="/en/cake-baker-accra">Cake baker</a>
<a href="/en/automobile-technician-accra">Automobile Technician</a>
<a href="/en/marketing-officer-accra-4">Marketing Officer</a>
因此我将代码更新为:
def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
jobs = soup.find_all('h2')
for tag in invalid_tag:
for match in jobs(tag):
match.replaceWithChildren()
print jobs
但我无法让它发挥作用,尽管我认为这是我能想到的最好的逻辑。虽然我是新手,但我知道有更好的事情可以做。
任何帮助都将受到优雅的赞赏
谢谢
答案 0 :(得分:1)
您可以浏览每个<h2>
代码的下一个元素:
for h2 in soup.find_all('h2'):
n = h2.next_element
if n.name == 'a': print n
它产生:
<a href="/en/financial-administrator-accra-1">Financial Administrator</a>
<a href="/en/house-help-accra-17">House help</a>
<a href="/en/office-manager-accra-1">Office Manager </a>
...